Dear Sparkers,

I'm loading into DataFrames data from 5 sources (using official
connectors): Parquet, MongoDB, Cassandra, MySQL and CSV. I'm then joining
those DataFrames in two different orders.
- mongo * cassandra * jdbc * parquet * csv (random order).
- parquet * csv * cassandra * jdbc * mongodb (optimized order).

The first follows a random order, whereas the second I'm deciding based on
some optimization techniques (can provide details for the interested
readers or if needed here).

After the evaluation on increasing sizes of data, the optimization
techniques I developed didn't improve the performance very noticeably. I
inspected the Logical/Physical plan of the final joined DataFrame (using
`explain(true)`). The 1st order was respected, whereas the 2nd order, it
turned out, wasn't respected, and MongoDB was queried first.

However, that what it seemed to me, I'm not quite confident reading the
Plans (returned using explain(true)). Could someone help explaining the
`explain(true)` output? (pasted in this gist
<https://gist.github.com/mnmami/387c24de5dca86c3b8efe170965e9dcf>). Is
there a way we could enforce the given order?

I'm using Spark 2.1, so I think it doesn't include the new cost-based
optimizations (introduced in Spark 2.2).

*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
*About me! <http://mohamednadjibmami.com>*
*LinkedIn <http://fr.linkedin.com/in/mohamednadjibmami/>*

Reply via email to