Re: eager execution and debuggability

2018-05-21 Thread Ryan Blue
+1 to job and stage info in the SQL visualization. This is one of the most difficult places for both users and our data platform team to understand. We've resorted to logging the plan that is compiled in `WholeStageCodegenExec` so at least we can go from a stage to what the plan was, but there's

Re: eager execution and debuggability

2018-05-10 Thread Ryan Blue
> it would be fantastic if we could make it easier to debug Spark programs without needing to rely on eager execution. I agree, it would be great if we could make the errors more clear about where the error happened (user code or in Spark code) and what assumption was violated. The problem is

Re: eager execution and debuggability

2018-05-10 Thread Lalwani, Jayesh
:45 PM To: Marco Gaido <marcogaid...@gmail.com> Cc: Ryan Blue <rb...@netflix.com>, Koert Kuipers <ko...@tresata.com>, dev <dev@spark.apache.org> Subject: Re: eager execution and debuggability Marco, There is understanding how Spark works, and there is finding bugs early in their

Re: eager execution and debuggability

2018-05-09 Thread Tim Hunter
The repr() trick is neat when working on a notebook. When working in a library, I used to use an evaluate(dataframe) -> DataFrame function that simply forces the materialization of a dataframe. As Reynold mentions, this is very convenient when working on a lot of chained UDFs, and it is a standard

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yes would be great if possible but it’s non trivial (might be impossible to do in general; we already have stacktraces that point to line numbers when an error occur in UDFs but clearly that’s not sufficient). Also in environments like REPL it’s still more useful to show error as soon as it

Re: eager execution and debuggability

2018-05-08 Thread Nicholas Chammas
This may be technically impractical, but it would be fantastic if we could make it easier to debug Spark programs without needing to rely on eager execution. Sprinkling .count() and .checkpoint() at various points in my code is still a debugging technique I use, but it always makes me wish Spark

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
I've opened SPARK-24215 to track this. On Tue, May 8, 2018 at 3:58 PM, Reynold Xin wrote: > Yup. Sounds great. This is something simple Spark can do and provide huge > value to the end users. > > > On Tue, May 8, 2018 at 3:53 PM Ryan Blue wrote: > >>

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yup. Sounds great. This is something simple Spark can do and provide huge value to the end users. On Tue, May 8, 2018 at 3:53 PM Ryan Blue wrote: > Would be great if it is something more turn-key. > > We can easily add the __repr__ and _repr_html_ methods and behavior to >

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
Would be great if it is something more turn-key. We can easily add the __repr__ and _repr_html_ methods and behavior to PySpark classes. We could also add a configuration property to determine whether the dataset evaluation is eager or not. That would make it turn-key for anyone running PySpark

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
s/underestimated/overestimated/ On Tue, May 8, 2018 at 3:44 PM Reynold Xin wrote: > Marco, > > There is understanding how Spark works, and there is finding bugs early in > their own program. One can perfectly understand how Spark works and still > find it valuable to get

Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Marco, There is understanding how Spark works, and there is finding bugs early in their own program. One can perfectly understand how Spark works and still find it valuable to get feedback asap, and that's why we built eager analysis in the first place. Also I'm afraid you've significantly

Re: eager execution and debuggability

2018-05-08 Thread Marco Gaido
I am not sure how this is useful. For students, it is important to understand how Spark works. This can be critical in many decision they have to take (whether and what to cache for instance) in order to have performant Spark application. Creating a eager execution probably can help them having

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
At Netflix, we use Jupyter notebooks and consoles for interactive sessions. For anyone interested, this mode of interaction is really easy to add in Jupyter and PySpark. You would just define a different *repr_html* or *repr* method for Dataset that runs a take(10) or take(100) and formats the

Re: eager execution and debuggability

2018-05-08 Thread Koert Kuipers
yeah we run into this all the time with new hires. they will send emails explaining there is an error in the .write operation and they are debugging the writing to disk, focusing on that piece of code :) unrelated, but another frequent cause for confusion is cascading errors. like the

eager execution and debuggability

2018-05-08 Thread Reynold Xin
Similar to the thread yesterday about improving ML/DL integration, I'm sending another email on what I've learned recently from Spark users. I recently talked to some educators that have been teaching Spark in their (top-tier) university classes. They are some of the most important users for