Re: Diffing execution plans to understand an optimizer bug

Nicholas Chammas Tue, 08 Nov 2016 14:01:51 -0800

Hmm, it doesn’t seem like I can access the output of
df._jdf.queryExecution().hiveResultString() from Python, and until I can
boil the issue down a bit, I’m stuck with using Python.


I’ll have a go at using regexes to strip some stuff from the printed plans.
The one that’s working for me to strip the IDs is #\d+L?.

Nick


On Tue, Nov 8, 2016 at 4:47 PM Reynold Xin <r...@databricks.com> wrote:

> If you want to peek into the internals and do crazy things, it is much
> easier to do it in Scala with df.queryExecution.
>
> For explain string output, you can work around the comparison simply by
> doing replaceAll("#\\d+", "#x")
>
> similar to the patch here:
> https://github.com/apache/spark/commit/fd90541c35af2bccf0155467bec8cea7c8865046#diff-432455394ca50800d5de508861984ca5R217
>
>
>
> On Tue, Nov 8, 2016 at 1:42 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> I’m trying to understand what I think is an optimizer bug. To do that, I’d
> like to compare the execution plans for a certain query with and without a
> certain change, to understand how that change is impacting the plan.
>
> How would I do that in PySpark? I’m working with 2.0.1, but I can use
> master if it helps.
>
> explain()
> <http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.explain>
> is helpful but is limited in two important ways:
>
>    1. It prints to screen and doesn’t offer another way to access the
>    plan or capture it.
>    2.
>
>    The printed plan includes auto-generated IDs that make diffing
>    impossible. e.g.
>
>     == Physical Plan ==
>     *Project [struct(primary_key#722, person#550, dataset_name#671)
>
>
> Any suggestions on what to do? Any relevant JIRAs I should follow?
>
> Nick
> 
>
>
>

Re: Diffing execution plans to understand an optimizer bug

Reply via email to