Hmm, it doesn’t seem like I can access the output of df._jdf.queryExecution().hiveResultString() from Python, and until I can boil the issue down a bit, I’m stuck with using Python.
I’ll have a go at using regexes to strip some stuff from the printed plans. The one that’s working for me to strip the IDs is #\d+L?. Nick On Tue, Nov 8, 2016 at 4:47 PM Reynold Xin <r...@databricks.com> wrote: > If you want to peek into the internals and do crazy things, it is much > easier to do it in Scala with df.queryExecution. > > For explain string output, you can work around the comparison simply by > doing replaceAll("#\\d+", "#x") > > similar to the patch here: > https://github.com/apache/spark/commit/fd90541c35af2bccf0155467bec8cea7c8865046#diff-432455394ca50800d5de508861984ca5R217 > > > > On Tue, Nov 8, 2016 at 1:42 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > > I’m trying to understand what I think is an optimizer bug. To do that, I’d > like to compare the execution plans for a certain query with and without a > certain change, to understand how that change is impacting the plan. > > How would I do that in PySpark? I’m working with 2.0.1, but I can use > master if it helps. > > explain() > <http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.explain> > is helpful but is limited in two important ways: > > 1. It prints to screen and doesn’t offer another way to access the > plan or capture it. > 2. > > The printed plan includes auto-generated IDs that make diffing > impossible. e.g. > > == Physical Plan == > *Project [struct(primary_key#722, person#550, dataset_name#671) > > > Any suggestions on what to do? Any relevant JIRAs I should follow? > > Nick > > > >