Hi Baahu, That should not be a problem, given you allocate sufficient buffer for reading.
I was just working on implementing a patch[1] to support the feature for reading wholetextfiles in SQL. This can actually be slightly better approach, because here we read to offheap memory for holding data(using unsafe interface). 1. https://github.com/apache/spark/pull/14151 Thanks, --Prashant On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain <bahub...@gmail.com> wrote: > Hi, > We have a requirement where in we need to process set of xml files, each > of the xml files contain several records (eg: > <RECORD> > data of record 1...... > </RECORD> > > <RECORD> > data of record 2...... > </RECORD> > > Expected output is <filename and individual records> > > Since we needed file name as well in output ,we chose wholetextfile() . We > had to go against using StreamXmlRecordReader and StreamInputFormat since I > could not find a way to retreive the filename. > > These xml files could be pretty big, occasionally they could reach a size > of 1GB.Since contents of each file would be put into a single partition,would > such big files be a issue ? > The AWS cluster(50 Nodes) that we use is fairly strong , with each machine > having memory of around 60GB. > > Thanks, > Baahu >