Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Herman van Hövell tot Westerflier
Replied in the ticket. On Tue, Nov 8, 2016 at 11:36 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > SPARK-18367 : limit() > makes the lame walk again > > On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas < > nicholas.cham...@gmail.com>

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
SPARK-18367 : limit() makes the lame walk again On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas wrote: > Hmm, it doesn’t seem like I can access the output of > df._jdf.queryExecution().hiveResultString() from Python,

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
Hmm, it doesn’t seem like I can access the output of df._jdf.queryExecution().hiveResultString() from Python, and until I can boil the issue down a bit, I’m stuck with using Python. I’ll have a go at using regexes to strip some stuff from the printed plans. The one that’s working for me to strip

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Reynold Xin
If you want to peek into the internals and do crazy things, it is much easier to do it in Scala with df.queryExecution. For explain string output, you can work around the comparison simply by doing replaceAll("#\\d+", "#x") similar to the patch here:

Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
I’m trying to understand what I think is an optimizer bug. To do that, I’d like to compare the execution plans for a certain query with and without a certain change, to understand how that change is impacting the plan. How would I do that in PySpark? I’m working with 2.0.1, but I can use master