Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/7028#issuecomment-117692804
  
    Hi @aarondav @andrewor14 , I would like to discuss this a little more.  I 
think I didn't understand your proposal earlier and I didn't really express my 
concerns properly -- sorry I am going to backpedal on what I said before.
    
    My concern is not so much about the implementation details, but just how 
this looks to the end user.  Certainly this change is better than the status 
quo.  But I'd rather we add this the right way the first time, rather than put 
in something now, make users learn this new exception format, and then down the 
line we update it and users need to learn some new format.  Understanding 
exceptions from Spark is already a big source of user confusion imo, so its 
worth spending a bit of time on this.
    
    There are 3 stack traces here: (a) from the executor where the original 
error occurred (b) where the exception was handled in 
`DAGScheduler.handleTaskSetFailed` as part of the event loop and (c) where the 
user code triggered the job with an action like `rdd.count()`.  (b) is probably 
totally useless for most users, its just occasionally useful for a spark 
developer, but right now we give the user (a) + (b).
    
    This pr puts (b) + (c) together as if they are one stack trace.  I 
definitely think that is an improvement -- users get to see (c), and mostly 
they'll just ignore (b) anyway so it mostly doesn't matter.  But there will 
definitely be times that a curious user tries to understand the rest of the 
stack trace.  Maybe they hit some spark bug and they want to try to understand 
it more before filing a jira, or they think that perhaps they are misusing 
spark and the stack trace will help them understand better, etc.  In those 
cases, I think it'll cause a lot of confusion to have it appear that its all 
one stack trace.
    
    If we did have them separated as three different stack traces, it would be 
much easier for a user to see understand in that case, and I think they'd also 
be much more likely to look for an explanation of what the different parts are. 
 (As I said earlier, I would not think I need to consult documentation to 
understand one simple stack trace.)
    
    I put together an alternative implementation here: 
https://github.com/apache/spark/pull/7156 (sadly the tests don't run b/c the 
scala compiler crashes running mima ... but its just meant for discussion, not 
to actually merge in any case, so just pretend the tests pass ...).  That is an 
alternate way to get all the stack traces -- it makes (a) + (b) the cause of 
(c), so you get a stack trace like:
    
    ```
    org.apache.spark.SparkException: job failed
            at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1741)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1759)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1774)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788)
            at org.apache.spark.rdd.RDD.count(RDD.scala:1095)
            at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38.apply$mcJ$sp(DAGSchedulerSuite.scala:883)
    ...[snip]
            at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: uh-oh!
            at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:883)
            at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:883)
            at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
            at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1627)
            at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
            at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
            at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1774)
            at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1774)
            at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
            at org.apache.spark.scheduler.Task.run(Task.scala:70)
            at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    
    Driver stacktrace:
            at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
            at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1285)
    ...[snip]
    ```
    
    I don't love that version either -- we're still doing a bit of magic that 
could confuse somebody if they look too closely.  Though I do think that a 
separate causing exception is at least a more clear break, rather than 
something that just looks like part of the call stack.   I'd really prefer if 
we could keep all three separate.
    
    Sorry to backtrack on my earlier comment.  I don't mean to create an 
unnecessary obstacle here.  I do think the patch as-is is still an improvement 
-- I suppose the curious user could just search the code for `==== Job 
Submission ====`, we could just add a clear comment in the code explaining what 
was going on.  I haven't really thought about the implementation in any great 
detail, so maybe it isn't worth it to do something more complicated.  (Also, I 
really haven't thought at all about what is going on w/ streaming -- perhaps 
that also makes it impossible to make a bigger change.)
    
    thanks for working on this and taking the time to discuss w/ me and read 
through my rambling :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to