[ https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-10538: -------------------------------- Target Version/s: 1.6.0 (was: 1.5.2) > java.lang.NegativeArraySizeException during join > ------------------------------------------------ > > Key: SPARK-10538 > URL: https://issues.apache.org/jira/browse/SPARK-10538 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Maciej BryĆski > Assignee: Davies Liu > Attachments: screenshot-1.png > > > Hi, > I've got a problem during joining tables in PySpark. (in my example 20 of > them) > I can observe that during calculation of first partition (on one of > consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) > vs on others partitions (approx. 272.5 KB / 113 record) > I can also observe that just before the crash python process going up to few > gb of RAM. > After some time there is an exception: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I'm running this on 2 nodes cluster (12 cores, 64 GB RAM) > Config: > {code} > spark.driver.memory 10g > spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC > -Dfile.encoding=UTF8 > spark.executor.memory 60g > spark.storage.memoryFraction 0.05 > spark.shuffle.memoryFraction 0.75 > spark.driver.maxResultSize 10g > spark.cores.max 24 > spark.kryoserializer.buffer.max 1g > spark.default.parallelism 200 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org