[ https://issues.apache.org/jira/browse/SPARK-17777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ameen Tayyebi updated SPARK-17777: ---------------------------------- Attachment: jstack-dump.txt > Spark Scheduler Hangs Indefinitely > ---------------------------------- > > Key: SPARK-17777 > URL: https://issues.apache.org/jira/browse/SPARK-17777 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Environment: AWS EMR 4.3, can also be reproduced locally > Reporter: Ameen Tayyebi > Attachments: jstack-dump.txt, repro.scala > > > We've identified a problem with Spark scheduling. The issue manifests itself > when an RDD calls SparkContext.parallelize within its getPartitions method. > This seemingly "recursive" call causes the problem. We have a repro case that > can easily be run. > Please advise on what the issue might be and how we can work around it in the > mean time. > I've attached repro.scala which can simply be pasted in spark-shell to > reproduce the problem. > Why are we calling sc.parallelize in production within getPartitions? Well, > we have an RDD that is composed of several thousands of Parquet files. To > compute the partitioning strategy for this RDD, we create an RDD to read all > file sizes from S3 in parallel, so that we can quickly determine the proper > partitions. We do this to avoid executing this serially from the master node > which can result in significant slowness in the execution. Pseudo-code: > val splitInfo = sc.parallelize(filePaths).map(f => (f, > s3.getObjectSummary)).collect() > A similar logic is used in DataFrame by Spark itself: > https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902 > > Thanks, > -Ameen -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org