Hi I'm looking for some benchmarks on joining data frames where most of the data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is still in RDBMS. I am only looking at the very first join before any caching happens, and I assume there will be loss of parallelization because JDBCRDD is probably bottlenecked on the max amount of parallel connection the database server can hold.
Are there any measurements / benchmarks that anyone did? ᐧ