Re: eager execution and debuggability

2018-05-21 Thread Ryan Blue
+1 to job and stage info in the SQL visualization. This is one of the most difficult places for both users and our data platform team to understand. We've resorted to logging the plan that is compiled in `WholeStageCodegenExec` so at least we can go from a stage to what the plan was, but there's n

Re: eager execution and debuggability

2018-05-14 Thread Tomasz Gawęda
Hi, >I agree, it would be great if we could make the errors more clear about where >the error happened (user code or in Spark code) and what assumption was >violated. The problem is that this is a really hard thing to do generally, >like Reynold said. I think we should look for individual cases

Re: eager execution and debuggability

2018-05-10 Thread Ryan Blue
> it would be fantastic if we could make it easier to debug Spark programs without needing to rely on eager execution. I agree, it would be great if we could make the errors more clear about where the error happened (user code or in Spark code) and what assumption was violated. The problem is that

Re: eager execution and debuggability

2018-05-10 Thread Lalwani, Jayesh
Blue , Koert Kuipers , dev Subject: Re: eager execution and debuggability Marco, There is understanding how Spark works, and there is finding bugs early in their own program. One can perfectly understand how Spark works and still find it valuable to get feedback asap, and that's why we

Re: eager execution and debuggability

2018-05-09 Thread Tim Hunter
The repr() trick is neat when working on a notebook. When working in a library, I used to use an evaluate(dataframe) -> DataFrame function that simply forces the materialization of a dataframe. As Reynold mentions, this is very convenient when working on a lot of chained UDFs, and it is a standard

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yes would be great if possible but it’s non trivial (might be impossible to do in general; we already have stacktraces that point to line numbers when an error occur in UDFs but clearly that’s not sufficient). Also in environments like REPL it’s still more useful to show error as soon as it occurs,

Re: eager execution and debuggability

2018-05-08 Thread Nicholas Chammas
This may be technically impractical, but it would be fantastic if we could make it easier to debug Spark programs without needing to rely on eager execution. Sprinkling .count() and .checkpoint() at various points in my code is still a debugging technique I use, but it always makes me wish Spark co

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
I've opened SPARK-24215 to track this. On Tue, May 8, 2018 at 3:58 PM, Reynold Xin wrote: > Yup. Sounds great. This is something simple Spark can do and provide huge > value to the end users. > > > On Tue, May 8, 2018 at 3:53 PM Ryan Blue wrote: > >> Would be great if it is something more turn-

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yup. Sounds great. This is something simple Spark can do and provide huge value to the end users. On Tue, May 8, 2018 at 3:53 PM Ryan Blue wrote: > Would be great if it is something more turn-key. > > We can easily add the __repr__ and _repr_html_ methods and behavior to > PySpark classes. We c

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
Would be great if it is something more turn-key. We can easily add the __repr__ and _repr_html_ methods and behavior to PySpark classes. We could also add a configuration property to determine whether the dataset evaluation is eager or not. That would make it turn-key for anyone running PySpark in

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
s/underestimated/overestimated/ On Tue, May 8, 2018 at 3:44 PM Reynold Xin wrote: > Marco, > > There is understanding how Spark works, and there is finding bugs early in > their own program. One can perfectly understand how Spark works and still > find it valuable to get feedback asap, and that'

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Marco, There is understanding how Spark works, and there is finding bugs early in their own program. One can perfectly understand how Spark works and still find it valuable to get feedback asap, and that's why we built eager analysis in the first place. Also I'm afraid you've significantly undere

Re: eager execution and debuggability

2018-05-08 Thread Marco Gaido
I am not sure how this is useful. For students, it is important to understand how Spark works. This can be critical in many decision they have to take (whether and what to cache for instance) in order to have performant Spark application. Creating a eager execution probably can help them having som

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
At Netflix, we use Jupyter notebooks and consoles for interactive sessions. For anyone interested, this mode of interaction is really easy to add in Jupyter and PySpark. You would just define a different *repr_html* or *repr* method for Dataset that runs a take(10) or take(100) and formats the resu

Re: eager execution and debuggability

2018-05-08 Thread Koert Kuipers
yeah we run into this all the time with new hires. they will send emails explaining there is an error in the .write operation and they are debugging the writing to disk, focusing on that piece of code :) unrelated, but another frequent cause for confusion is cascading errors. like the FetchFailedE

eager execution and debuggability

2018-05-08 Thread Reynold Xin
Similar to the thread yesterday about improving ML/DL integration, I'm sending another email on what I've learned recently from Spark users. I recently talked to some educators that have been teaching Spark in their (top-tier) university classes. They are some of the most important users for adopti