Thanks for the quick response. Its a single XML file and I am using a top level rowTag. So, it creates only one row in a Dataframe with 5 columns. One of these columns will contain most of the data as StructType. Is there a limitation to store data in a cell of a Dataframe?
I will check with new version and try to use different rowTags and increase executor-memory tomorrow. I will open a new issue as well. On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Hi Arun, > > > I have few questions. > > Dose your XML file have like few huge documents? In this case of a row > having a huge size like (like 500MB), it would consume a lot of memory > > becuase at least it should hold a row to iterate if I remember correctly. > I remember this happened to me before while processing a huge record for > test purpose. > > > How about trying to increase --executor-memory? > > > Also, you could try to select only few fields to prune the data with the > latest version just to doubly sure if you don't mind?. > > > Lastly, do you mind if I ask to open an issue in https://github.com/ > databricks/spark-xml/issues if you still face this problem? > > I will try to take a look at my best. > > > Thank you. > > > 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>: > >> I am trying to read an XML file which is 1GB is size. I am getting an >> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM >> limit' after reading 7 partitions in local mode. In Yarn mode, it >> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading >> 3 partitions. >> >> Any suggestion? >> >> PySpark Shell Command: pyspark --master local[4] --driver-memory 3G >> --jars / tmp/spark-xml_2.10-0.3.3.jar >> >> >> >> Dataframe Creation Command: df = sqlContext.read.format('com.da >> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml') >> >> >> >> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 >> (TID 1) in 25978 ms on localhost (1/10) >> >> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728 >> >> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). >> 2309 bytes result sent to driver >> >> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 >> (TID 3, localhost, partition 3,ANY, 2266 bytes) >> >> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3) >> >> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 >> (TID 2) in 51001 ms on localhost (2/10) >> >> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728 >> >> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). >> 2309 bytes result sent to driver >> >> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 >> (TID 4, localhost, partition 4,ANY, 2266 bytes) >> >> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) >> >> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 >> (TID 3) in 24336 ms on localhost (3/10) >> >> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728 >> >> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). >> 2309 bytes result sent to driver >> >> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 >> (TID 5, localhost, partition 5,ANY, 2266 bytes) >> >> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5) >> >> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 >> (TID 4) in 20895 ms on localhost (4/10) >> >> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728 >> >> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). >> 2309 bytes result sent to driver >> >> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 >> (TID 6, localhost, partition 6,ANY, 2266 bytes) >> >> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6) >> >> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 >> (TID 5) in 20793 ms on localhost (5/10) >> >> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728 >> >> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). >> 2309 bytes result sent to driver >> >> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 >> (TID 7, localhost, partition 7,ANY, 2266 bytes) >> >> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7) >> >> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 >> (TID 6) in 21306 ms on localhost (6/10) >> >> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728 >> >> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). >> 2309 bytes result sent to driver >> >> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 >> (TID 8, localhost, partition 8,ANY, 2266 bytes) >> >> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8) >> >> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 >> (TID 7) in 21130 ms on localhost (7/10) >> >> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split: >> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728 >> >> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID >> 0) >> >> java.lang.OutOfMemoryError: Requested array size exceeds VM limit >> >> at java.util.Arrays.copyOf(Arrays.java:2271) >> >> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav >> a:113) >> >> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >> Stream.java:93) >> >> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja >> va:122) >> >> at java.io.DataOutputStream.write(DataOutputStream.java:88) >> >> at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(XmlI >> nputFormat.scala:188) >> >> at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat >> .scala:156) >> >> at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp >> utFormat.scala:141) >> >> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR >> DD.scala:168) >> >> at org.apache.spark.InterruptibleIterator.hasNext(Interruptible >> Iterator.scala:39) >> >> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> >> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >> >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> >> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> >> at scala.collection.TraversableOnce$class.foldLeft(TraversableO >> nce.scala:144) >> >> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:11 >> 57) >> >> at scala.collection.TraversableOnce$class.aggregate(Traversable >> Once.scala:201) >> >> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1 >> 157) >> >> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >> 4.apply(RDD.scala:1142) >> >> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >> 4.apply(RDD.scala:1142) >> >> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >> 5.apply(RDD.scala:1143) >> >> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >> 5.apply(RDD.scala:1143) >> >> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a >> pply$22.apply(RDD.scala:717) >> >> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a >> pply$22.apply(RDD.scala:717) >> >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >> DD.scala:38) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3 >> 13) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) >> >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >> DD.scala:38) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3 >> 13) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) >> >> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >> Task.scala:73) >> >> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >> Task.scala:41) >> >> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught exception >> in thread Thread[Executor task launch worker-0,5,main] >> >> java.lang.OutOfMemoryError: Requested array size exceeds VM limit >> >> >> >