I've posted a patch that I think produces the correct behavior at
https://github.com/kellrott/incubator-spark/commit/efe1102c8a7436b2fe112d3bece9f35fedea0dc8

It works fine on my programs, but if I run the unit tests, I get errors
like:

[info] - large number of iterations *** FAILED ***
[info]   org.apache.spark.SparkException: Job aborted: Task 4.0:0 failed
more than 0 times; aborting job java.lang.ClassCastException:
scala.collection.immutable.StreamIterator cannot be cast to
scala.collection.mutable.ArrayBuffer
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:818)
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:816)
[info]   at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
[info]   at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info]   at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:816)
[info]   at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:431)
[info]   at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:493)
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:158)


I can't figure out the line number of where the original error occurred. Or
why I can't replicate them in my various test programs.
Any help would be appreciated.

Kyle






On Tue, Nov 12, 2013 at 11:35 AM, Alex Boisvert <[email protected]>wrote:

> On Tue, Nov 12, 2013 at 11:07 AM, Stephen Haberman <
> [email protected]> wrote:
>
> > Huge disclaimer that this is probably a big pita to implement, and
> > could likely not be as worthwhile as I naively think it would be.
> >
>
> My perspective on this is it's already big pita of Spark users today.
>
> In the absence of explicit directions/hints, Spark should be able to make
> ballpark estimates and conservatively pick # of partitions, storage
> strategies (e.g., memory vs disk) and other runtime parameters that fit the
> deployment architecture/capacities.   If this requires code and extra
> runtime resources for sampling/measuring data, guestimating job size, and
> so on, so be it.
>
> Users want working jobs first.  Optimal performance / resource utilization
> follow from that.
>

Reply via email to