Thanks for the quick response.

Its a single XML file and I am using a top level rowTag.  So, it creates
only one row in a Dataframe with 5 columns. One of these columns will
contain most of the data as StructType.  Is there a limitation to store
data in a cell of a Dataframe?

I will check with new version and try to use different rowTags and increase
executor-memory tomorrow. I will open a new issue as well.



On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi Arun,
>
>
> I have few questions.
>
> Dose your XML file have like few huge documents? In this case of a row
> having a huge size like (like 500MB), it would consume a lot of memory
>
> becuase at least it should hold a row to iterate if I remember correctly.
> I remember this happened to me before while processing a huge record for
> test purpose.
>
>
> How about trying to increase --executor-memory?
>
>
> Also, you could try to select only few fields to prune the data with the
> latest version just to doubly sure if you don't mind?.
>
>
> Lastly, do you mind if I ask to open an issue in https://github.com/
> databricks/spark-xml/issues if you still face this problem?
>
> I will try to take a look at my best.
>
>
> Thank you.
>
>
> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>:
>
>> I am trying to read an XML file which is 1GB is size.  I am getting an
>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading
>> 3 partitions.
>>
>> Any suggestion?
>>
>> PySpark Shell Command:    pyspark --master local[4] --driver-memory 3G
>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>
>>
>>
>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>
>>
>>
>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>> (TID 1) in 25978 ms on localhost (1/10)
>>
>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>
>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>
>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>
>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>> (TID 2) in 51001 ms on localhost (2/10)
>>
>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>
>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>
>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>
>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>> (TID 3) in 24336 ms on localhost (3/10)
>>
>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>
>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>
>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>>
>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
>> (TID 4) in 20895 ms on localhost (4/10)
>>
>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>>
>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0
>> (TID 6, localhost, partition 6,ANY, 2266 bytes)
>>
>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>>
>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0
>> (TID 5) in 20793 ms on localhost (5/10)
>>
>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>>
>> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0
>> (TID 7, localhost, partition 7,ANY, 2266 bytes)
>>
>> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>>
>> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0
>> (TID 6) in 21306 ms on localhost (6/10)
>>
>> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>>
>> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0
>> (TID 8, localhost, partition 8,ANY, 2266 bytes)
>>
>> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>>
>> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0
>> (TID 7) in 21130 ms on localhost (7/10)
>>
>> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728
>>
>> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>>
>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>
>>         at java.util.Arrays.copyOf(Arrays.java:2271)
>>
>>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav
>> a:113)
>>
>>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
>> Stream.java:93)
>>
>>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja
>> va:122)
>>
>>         at java.io.DataOutputStream.write(DataOutputStream.java:88)
>>
>>         at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(XmlI
>> nputFormat.scala:188)
>>
>>         at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat
>> .scala:156)
>>
>>         at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp
>> utFormat.scala:141)
>>
>>         at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR
>> DD.scala:168)
>>
>>         at org.apache.spark.InterruptibleIterator.hasNext(Interruptible
>> Iterator.scala:39)
>>
>>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>
>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>
>>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>
>>         at scala.collection.TraversableOnce$class.foldLeft(TraversableO
>> nce.scala:144)
>>
>>         at scala.collection.AbstractIterator.foldLeft(Iterator.scala:11
>> 57)
>>
>>         at scala.collection.TraversableOnce$class.aggregate(Traversable
>> Once.scala:201)
>>
>>         at scala.collection.AbstractIterator.aggregate(Iterator.scala:1
>> 157)
>>
>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>> 4.apply(RDD.scala:1142)
>>
>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>> 4.apply(RDD.scala:1142)
>>
>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>> 5.apply(RDD.scala:1143)
>>
>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>> 5.apply(RDD.scala:1143)
>>
>>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a
>> pply$22.apply(RDD.scala:717)
>>
>>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a
>> pply$22.apply(RDD.scala:717)
>>
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3
>> 13)
>>
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>>
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3
>> 13)
>>
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>>
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>>
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>>
>> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught exception
>> in thread Thread[Executor task launch worker-0,5,main]
>>
>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>
>>
>>
>

Reply via email to