Hi Baahu,

That should not be a problem, given you allocate sufficient buffer for
reading.

I was just working on implementing a patch[1] to support the feature for
reading wholetextfiles in SQL. This can actually be slightly better
approach, because here we read to offheap memory for holding data(using
unsafe interface).

1. https://github.com/apache/spark/pull/14151

Thanks,



--Prashant


On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain <bahub...@gmail.com> wrote:

> Hi,
> We have a requirement where in we need to process set of xml files, each
> of the xml files contain several records (eg:
> <RECORD>
>      data of record 1......
> </RECORD>
>
> <RECORD>
>     data of record 2......
> </RECORD>
>
> Expected output is   <filename and individual records>
>
> Since we needed file name as well in output ,we chose wholetextfile() . We
> had to go against using StreamXmlRecordReader and StreamInputFormat since I
> could not find a way to retreive the filename.
>
> These xml files could be pretty big, occasionally they could reach a size
> of 1GB.Since contents of each file would be put into a single partition,would
> such big files be a issue ?
> The AWS cluster(50 Nodes) that we use is fairly strong , with each machine
> having memory of around 60GB.
>
> Thanks,
> Baahu
>

Reply via email to