[ 
https://issues.apache.org/jira/browse/SPARK-17777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548784#comment-15548784
 ] 

Ameen Tayyebi commented on SPARK-17777:
---------------------------------------

Thanks for the comments Sean. 

Your statement is about behavior of things that are not supposed to work 
varying is fair. I usually run into situations where if something works, it's 
been intentional. Let me share some more data points with you:

If you change the very last line of the repro steps to y.map(r => r).count() 
then the computation will correctly finish. In other words, this only halts if 
a shuffle occurs.
If you do *not* specify spark.default.parallelism, the same exact repro step 
code will function correctly and finish.

Let's forget about the code I shared regarding DataFrame, I think that's 
distracting us from the issue. I'm not claiming that that piece of code is 
executing in the same "spot" in scheduler as the RDD code I'm providing. My 
intention was to demonstrate that acquiring splitting knowledge in parallel 
using the cluster itself has done before.

Before we can dismiss this as "by design", I would appreciate it if you'd 
explain to me why this is not supposed to work. I haven't found any 
documentation, or any references in code that would hint this is not supposed 
to work. It feels odd that not specifying spark.default.paralellism would make 
this work, or that if there's no shuffle, things work fine.

My use case is this:
I'd like to acquire split information for my RDD, inlined with the Spark 
philosophy of doing things only if necessary (lazy evaluation). Doing this in 
getPartitions allows me to accomplish this. If this is not supported, is there 
an alternative that you can suggest?

Ideally the following code executes "instantly":

val r = new MyCustomRDD();
r.map(a => a);

If I'm forced to pull the computation of partitions out of getPartitions, then 
this line will have to do work:

val r = new MyCustomRDD(); // has to compute partitions even though it's not 
necessary at this point because no action has been performed against the RDD

Thanks for your time!

> Spark Scheduler Hangs Indefinitely
> ----------------------------------
>
>                 Key: SPARK-17777
>                 URL: https://issues.apache.org/jira/browse/SPARK-17777
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: AWS EMR 4.3, can also be reproduced locally
>            Reporter: Ameen Tayyebi
>         Attachments: repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to 
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well, 
> we have an RDD that is composed of several thousands of Parquet files. To 
> compute the partitioning strategy for this RDD, we create an RDD to read all 
> file sizes from S3 in parallel, so that we can quickly determine the proper 
> partitions. We do this to avoid executing this serially from the master node 
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f, 
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>  
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to