If you want to peek into the internals and do crazy things, it is much
easier to do it in Scala with df.queryExecution.

For explain string output, you can work around the comparison simply by
doing replaceAll("#\\d+", "#x")

similar to the patch here:
https://github.com/apache/spark/commit/fd90541c35af2bccf0155467bec8cea7c8865046#diff-432455394ca50800d5de508861984ca5R217



On Tue, Nov 8, 2016 at 1:42 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> I’m trying to understand what I think is an optimizer bug. To do that, I’d
> like to compare the execution plans for a certain query with and without a
> certain change, to understand how that change is impacting the plan.
>
> How would I do that in PySpark? I’m working with 2.0.1, but I can use
> master if it helps.
>
> explain()
> <http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.explain>
> is helpful but is limited in two important ways:
>
>    1. It prints to screen and doesn’t offer another way to access the
>    plan or capture it.
>    2.
>
>    The printed plan includes auto-generated IDs that make diffing
>    impossible. e.g.
>
>     == Physical Plan ==
>     *Project [struct(primary_key#722, person#550, dataset_name#671)
>
>
> Any suggestions on what to do? Any relevant JIRAs I should follow?
>
> Nick
> ​
>

Reply via email to