The repr() trick is neat when working on a notebook. When working in a
library, I used to use an evaluate(dataframe) -> DataFrame function that
simply forces the materialization of a dataframe. As Reynold mentions, this
is very convenient when working on a lot of chained UDFs, and it is a
standard trick in lazy environments and languages.

Tim

On Wed, May 9, 2018 at 3:26 AM, Reynold Xin <r...@databricks.com> wrote:

> Yes would be great if possible but it’s non trivial (might be impossible
> to do in general; we already have stacktraces that point to line numbers
> when an error occur in UDFs but clearly that’s not sufficient). Also in
> environments like REPL it’s still more useful to show error as soon as it
> occurs, rather than showing it potentially 30 lines later.
>
> On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This may be technically impractical, but it would be fantastic if we
>> could make it easier to debug Spark programs without needing to rely on
>> eager execution. Sprinkling .count() and .checkpoint() at various points
>> in my code is still a debugging technique I use, but it always makes me
>> wish Spark could point more directly to the offending transformation when
>> something goes wrong.
>>
>> Is it somehow possible to have each individual operator (is that the
>> correct term?) in a DAG include metadata pointing back to the line of code
>> that generated the operator? That way when an action triggers an error, the
>> failing operation can point to the relevant line of code — even if it’s a
>> transformation — and not just the action on the tail end that triggered the
>> error.
>>
>> I don’t know how feasible this is, but addressing it would directly solve
>> the issue of linking failures to the responsible transformation, as opposed
>> to leaving the user to break up a chain of transformations with several
>> debug actions. And this would benefit new and experienced users alike.
>>
>> Nick
>>
>> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
>> <http://mailto:rb...@netflix.com.invalid>님이 작성:
>>
>> I've opened SPARK-24215 to track this.
>>>
>>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Yup. Sounds great. This is something simple Spark can do and provide
>>>> huge value to the end users.
>>>>
>>>>
>>>> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Would be great if it is something more turn-key.
>>>>>
>>>>> We can easily add the __repr__ and _repr_html_ methods and behavior
>>>>> to PySpark classes. We could also add a configuration property to 
>>>>> determine
>>>>> whether the dataset evaluation is eager or not. That would make it 
>>>>> turn-key
>>>>> for anyone running PySpark in Jupyter.
>>>>>
>>>>> For JVM languages, we could also add a dependency on jvm-repr and do
>>>>> the same thing.
>>>>>
>>>>> rb
>>>>> ​
>>>>>
>>>>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> s/underestimated/overestimated/
>>>>>>
>>>>>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Marco,
>>>>>>>
>>>>>>> There is understanding how Spark works, and there is finding bugs
>>>>>>> early in their own program. One can perfectly understand how Spark works
>>>>>>> and still find it valuable to get feedback asap, and that's why we built
>>>>>>> eager analysis in the first place.
>>>>>>>
>>>>>>> Also I'm afraid you've significantly underestimated the level of
>>>>>>> technical sophistication of users. In many cases they struggle to get
>>>>>>> anything to work, and performance optimization of their programs is
>>>>>>> secondary to getting things working. As John Ousterhout says, "the 
>>>>>>> greatest
>>>>>>> performance improvement of all is when a system goes from not-working to
>>>>>>> working".
>>>>>>>
>>>>>>> I really like Ryan's approach. Would be great if it is something
>>>>>>> more turn-key.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am not sure how this is useful. For students, it is important to
>>>>>>>> understand how Spark works. This can be critical in many decision they 
>>>>>>>> have
>>>>>>>> to take (whether and what to cache for instance) in order to have
>>>>>>>> performant Spark application. Creating a eager execution probably can 
>>>>>>>> help
>>>>>>>> them having something running more easily, but let them also using 
>>>>>>>> Spark
>>>>>>>> knowing less about how it works, thus they are likely to write worse
>>>>>>>> application and to have more problems in debugging any kind of problem
>>>>>>>> which may later (in production) occur (therefore affecting their 
>>>>>>>> experience
>>>>>>>> with the tool).
>>>>>>>>
>>>>>>>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>>>>>>>> execution, helping in the debugging phase. So they can achieve without 
>>>>>>>> a
>>>>>>>> big effort the same result, but with a big difference: they are aware 
>>>>>>>> of
>>>>>>>> what is really happening, which may help them later.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Marco
>>>>>>>>
>>>>>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>>>>>>>>
>>>>>>>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>>>>>>>> sessions. For anyone interested, this mode of interaction is really 
>>>>>>>>> easy to
>>>>>>>>> add in Jupyter and PySpark. You would just define a different
>>>>>>>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>>>>>>>> take(100) and formats the result.
>>>>>>>>>
>>>>>>>>> That way, the output of a cell or console execution always causes
>>>>>>>>> the dataframe to run and get displayed for that immediate feedback. 
>>>>>>>>> But,
>>>>>>>>> there is no change to Spark’s behavior because the action is run by 
>>>>>>>>> the
>>>>>>>>> REPL, and only when a dataframe is a result of an execution in order 
>>>>>>>>> to
>>>>>>>>> display it. Intermediate results wouldn’t be run, but that gives 
>>>>>>>>> users a
>>>>>>>>> way to avoid too many executions and would still support method 
>>>>>>>>> chaining in
>>>>>>>>> the dataframe API (which would be horrible with an aggressive 
>>>>>>>>> execution
>>>>>>>>> model).
>>>>>>>>>
>>>>>>>>> There are ways to do this in JVM languages as well if you are
>>>>>>>>> using a Scala or Java interpreter (see jvm-repr
>>>>>>>>> <https://github.com/jupyter/jvm-repr>). This is actually what we
>>>>>>>>> do in our Spark-based SQL interpreter to display results.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>> ​
>>>>>>>>>
>>>>>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> yeah we run into this all the time with new hires. they will send
>>>>>>>>>> emails explaining there is an error in the .write operation and they 
>>>>>>>>>> are
>>>>>>>>>> debugging the writing to disk, focusing on that piece of code :)
>>>>>>>>>>
>>>>>>>>>> unrelated, but another frequent cause for confusion is cascading
>>>>>>>>>> errors. like the FetchFailedException. they will be debugging the 
>>>>>>>>>> reducer
>>>>>>>>>> task not realizing the error happened before that, and the
>>>>>>>>>> FetchFailedException is not the root cause.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Similar to the thread yesterday about improving ML/DL
>>>>>>>>>>> integration, I'm sending another email on what I've learned 
>>>>>>>>>>> recently from
>>>>>>>>>>> Spark users. I recently talked to some educators that have been 
>>>>>>>>>>> teaching
>>>>>>>>>>> Spark in their (top-tier) university classes. They are some of the 
>>>>>>>>>>> most
>>>>>>>>>>> important users for adoption because of the multiplicative effect 
>>>>>>>>>>> they have
>>>>>>>>>>> on the future generation.
>>>>>>>>>>>
>>>>>>>>>>> To my surprise the single biggest ask they want is to enable
>>>>>>>>>>> eager execution mode on all operations for teaching and 
>>>>>>>>>>> debuggability:
>>>>>>>>>>>
>>>>>>>>>>> (1) Most of the students are relatively new to programming, and
>>>>>>>>>>> they need multiple iterations to even get the most basic operation 
>>>>>>>>>>> right.
>>>>>>>>>>> In these cases, in order to trigger an error, they would need to 
>>>>>>>>>>> explicitly
>>>>>>>>>>> add actions, which is non-intuitive.
>>>>>>>>>>>
>>>>>>>>>>> (2) If they don't add explicit actions to every operation and
>>>>>>>>>>> there is a mistake, the error pops up somewhere later where an 
>>>>>>>>>>> action is
>>>>>>>>>>> triggered. This is in a different position from the code that 
>>>>>>>>>>> causes the
>>>>>>>>>>> problem, and difficult for students to correlate the two.
>>>>>>>>>>>
>>>>>>>>>>> I suspect in the real world a lot of Spark users also struggle
>>>>>>>>>>> in similar ways as these students. While eager execution is really 
>>>>>>>>>>> not
>>>>>>>>>>> practical in big data, in learning environments or in development 
>>>>>>>>>>> against
>>>>>>>>>>> small, sampled datasets it can be pretty helpful.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>> ​
>>
>

Reply via email to