Github user squito commented on the pull request: https://github.com/apache/spark/pull/7028#issuecomment-117692804 Hi @aarondav @andrewor14 , I would like to discuss this a little more. I think I didn't understand your proposal earlier and I didn't really express my concerns properly -- sorry I am going to backpedal on what I said before. My concern is not so much about the implementation details, but just how this looks to the end user. Certainly this change is better than the status quo. But I'd rather we add this the right way the first time, rather than put in something now, make users learn this new exception format, and then down the line we update it and users need to learn some new format. Understanding exceptions from Spark is already a big source of user confusion imo, so its worth spending a bit of time on this. There are 3 stack traces here: (a) from the executor where the original error occurred (b) where the exception was handled in `DAGScheduler.handleTaskSetFailed` as part of the event loop and (c) where the user code triggered the job with an action like `rdd.count()`. (b) is probably totally useless for most users, its just occasionally useful for a spark developer, but right now we give the user (a) + (b). This pr puts (b) + (c) together as if they are one stack trace. I definitely think that is an improvement -- users get to see (c), and mostly they'll just ignore (b) anyway so it mostly doesn't matter. But there will definitely be times that a curious user tries to understand the rest of the stack trace. Maybe they hit some spark bug and they want to try to understand it more before filing a jira, or they think that perhaps they are misusing spark and the stack trace will help them understand better, etc. In those cases, I think it'll cause a lot of confusion to have it appear that its all one stack trace. If we did have them separated as three different stack traces, it would be much easier for a user to see understand in that case, and I think they'd also be much more likely to look for an explanation of what the different parts are. (As I said earlier, I would not think I need to consult documentation to understand one simple stack trace.) I put together an alternative implementation here: https://github.com/apache/spark/pull/7156 (sadly the tests don't run b/c the scala compiler crashes running mima ... but its just meant for discussion, not to actually merge in any case, so just pretend the tests pass ...). That is an alternate way to get all the stack traces -- it makes (a) + (b) the cause of (c), so you get a stack trace like: ``` org.apache.spark.SparkException: job failed at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1741) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1759) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1774) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1788) at org.apache.spark.rdd.RDD.count(RDD.scala:1095) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38.apply$mcJ$sp(DAGSchedulerSuite.scala:883) ...[snip] at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: uh-oh! at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:883) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:883) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1627) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1774) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1774) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1285) ...[snip] ``` I don't love that version either -- we're still doing a bit of magic that could confuse somebody if they look too closely. Though I do think that a separate causing exception is at least a more clear break, rather than something that just looks like part of the call stack. I'd really prefer if we could keep all three separate. Sorry to backtrack on my earlier comment. I don't mean to create an unnecessary obstacle here. I do think the patch as-is is still an improvement -- I suppose the curious user could just search the code for `==== Job Submission ====`, we could just add a clear comment in the code explaining what was going on. I haven't really thought about the implementation in any great detail, so maybe it isn't worth it to do something more complicated. (Also, I really haven't thought at all about what is going on w/ streaming -- perhaps that also makes it impossible to make a bigger change.) thanks for working on this and taking the time to discuss w/ me and read through my rambling :)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org