The repr() trick is neat when working on a notebook. When working in a library, I used to use an evaluate(dataframe) -> DataFrame function that simply forces the materialization of a dataframe. As Reynold mentions, this is very convenient when working on a lot of chained UDFs, and it is a standard trick in lazy environments and languages.
Tim On Wed, May 9, 2018 at 3:26 AM, Reynold Xin <r...@databricks.com> wrote: > Yes would be great if possible but it’s non trivial (might be impossible > to do in general; we already have stacktraces that point to line numbers > when an error occur in UDFs but clearly that’s not sufficient). Also in > environments like REPL it’s still more useful to show error as soon as it > occurs, rather than showing it potentially 30 lines later. > > On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This may be technically impractical, but it would be fantastic if we >> could make it easier to debug Spark programs without needing to rely on >> eager execution. Sprinkling .count() and .checkpoint() at various points >> in my code is still a debugging technique I use, but it always makes me >> wish Spark could point more directly to the offending transformation when >> something goes wrong. >> >> Is it somehow possible to have each individual operator (is that the >> correct term?) in a DAG include metadata pointing back to the line of code >> that generated the operator? That way when an action triggers an error, the >> failing operation can point to the relevant line of code — even if it’s a >> transformation — and not just the action on the tail end that triggered the >> error. >> >> I don’t know how feasible this is, but addressing it would directly solve >> the issue of linking failures to the responsible transformation, as opposed >> to leaving the user to break up a chain of transformations with several >> debug actions. And this would benefit new and experienced users alike. >> >> Nick >> >> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid >> <http://mailto:rb...@netflix.com.invalid>님이 작성: >> >> I've opened SPARK-24215 to track this. >>> >>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote: >>> >>>> Yup. Sounds great. This is something simple Spark can do and provide >>>> huge value to the end users. >>>> >>>> >>>> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> Would be great if it is something more turn-key. >>>>> >>>>> We can easily add the __repr__ and _repr_html_ methods and behavior >>>>> to PySpark classes. We could also add a configuration property to >>>>> determine >>>>> whether the dataset evaluation is eager or not. That would make it >>>>> turn-key >>>>> for anyone running PySpark in Jupyter. >>>>> >>>>> For JVM languages, we could also add a dependency on jvm-repr and do >>>>> the same thing. >>>>> >>>>> rb >>>>> >>>>> >>>>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> s/underestimated/overestimated/ >>>>>> >>>>>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> Marco, >>>>>>> >>>>>>> There is understanding how Spark works, and there is finding bugs >>>>>>> early in their own program. One can perfectly understand how Spark works >>>>>>> and still find it valuable to get feedback asap, and that's why we built >>>>>>> eager analysis in the first place. >>>>>>> >>>>>>> Also I'm afraid you've significantly underestimated the level of >>>>>>> technical sophistication of users. In many cases they struggle to get >>>>>>> anything to work, and performance optimization of their programs is >>>>>>> secondary to getting things working. As John Ousterhout says, "the >>>>>>> greatest >>>>>>> performance improvement of all is when a system goes from not-working to >>>>>>> working". >>>>>>> >>>>>>> I really like Ryan's approach. Would be great if it is something >>>>>>> more turn-key. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I am not sure how this is useful. For students, it is important to >>>>>>>> understand how Spark works. This can be critical in many decision they >>>>>>>> have >>>>>>>> to take (whether and what to cache for instance) in order to have >>>>>>>> performant Spark application. Creating a eager execution probably can >>>>>>>> help >>>>>>>> them having something running more easily, but let them also using >>>>>>>> Spark >>>>>>>> knowing less about how it works, thus they are likely to write worse >>>>>>>> application and to have more problems in debugging any kind of problem >>>>>>>> which may later (in production) occur (therefore affecting their >>>>>>>> experience >>>>>>>> with the tool). >>>>>>>> >>>>>>>> Moreover, as Ryan also mentioned, there are tools/ways to force the >>>>>>>> execution, helping in the debugging phase. So they can achieve without >>>>>>>> a >>>>>>>> big effort the same result, but with a big difference: they are aware >>>>>>>> of >>>>>>>> what is really happening, which may help them later. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Marco >>>>>>>> >>>>>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>: >>>>>>>> >>>>>>>>> At Netflix, we use Jupyter notebooks and consoles for interactive >>>>>>>>> sessions. For anyone interested, this mode of interaction is really >>>>>>>>> easy to >>>>>>>>> add in Jupyter and PySpark. You would just define a different >>>>>>>>> *repr_html* or *repr* method for Dataset that runs a take(10) or >>>>>>>>> take(100) and formats the result. >>>>>>>>> >>>>>>>>> That way, the output of a cell or console execution always causes >>>>>>>>> the dataframe to run and get displayed for that immediate feedback. >>>>>>>>> But, >>>>>>>>> there is no change to Spark’s behavior because the action is run by >>>>>>>>> the >>>>>>>>> REPL, and only when a dataframe is a result of an execution in order >>>>>>>>> to >>>>>>>>> display it. Intermediate results wouldn’t be run, but that gives >>>>>>>>> users a >>>>>>>>> way to avoid too many executions and would still support method >>>>>>>>> chaining in >>>>>>>>> the dataframe API (which would be horrible with an aggressive >>>>>>>>> execution >>>>>>>>> model). >>>>>>>>> >>>>>>>>> There are ways to do this in JVM languages as well if you are >>>>>>>>> using a Scala or Java interpreter (see jvm-repr >>>>>>>>> <https://github.com/jupyter/jvm-repr>). This is actually what we >>>>>>>>> do in our Spark-based SQL interpreter to display results. >>>>>>>>> >>>>>>>>> rb >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> yeah we run into this all the time with new hires. they will send >>>>>>>>>> emails explaining there is an error in the .write operation and they >>>>>>>>>> are >>>>>>>>>> debugging the writing to disk, focusing on that piece of code :) >>>>>>>>>> >>>>>>>>>> unrelated, but another frequent cause for confusion is cascading >>>>>>>>>> errors. like the FetchFailedException. they will be debugging the >>>>>>>>>> reducer >>>>>>>>>> task not realizing the error happened before that, and the >>>>>>>>>> FetchFailedException is not the root cause. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Similar to the thread yesterday about improving ML/DL >>>>>>>>>>> integration, I'm sending another email on what I've learned >>>>>>>>>>> recently from >>>>>>>>>>> Spark users. I recently talked to some educators that have been >>>>>>>>>>> teaching >>>>>>>>>>> Spark in their (top-tier) university classes. They are some of the >>>>>>>>>>> most >>>>>>>>>>> important users for adoption because of the multiplicative effect >>>>>>>>>>> they have >>>>>>>>>>> on the future generation. >>>>>>>>>>> >>>>>>>>>>> To my surprise the single biggest ask they want is to enable >>>>>>>>>>> eager execution mode on all operations for teaching and >>>>>>>>>>> debuggability: >>>>>>>>>>> >>>>>>>>>>> (1) Most of the students are relatively new to programming, and >>>>>>>>>>> they need multiple iterations to even get the most basic operation >>>>>>>>>>> right. >>>>>>>>>>> In these cases, in order to trigger an error, they would need to >>>>>>>>>>> explicitly >>>>>>>>>>> add actions, which is non-intuitive. >>>>>>>>>>> >>>>>>>>>>> (2) If they don't add explicit actions to every operation and >>>>>>>>>>> there is a mistake, the error pops up somewhere later where an >>>>>>>>>>> action is >>>>>>>>>>> triggered. This is in a different position from the code that >>>>>>>>>>> causes the >>>>>>>>>>> problem, and difficult for students to correlate the two. >>>>>>>>>>> >>>>>>>>>>> I suspect in the real world a lot of Spark users also struggle >>>>>>>>>>> in similar ways as these students. While eager execution is really >>>>>>>>>>> not >>>>>>>>>>> practical in big data, in learning environments or in development >>>>>>>>>>> against >>>>>>>>>>> small, sampled datasets it can be pretty helpful. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> >