I believe what I am looking for is DataFrameWriter.bucketBy which
would allow for bucketing into physical parquet files by the desired
columns. Then my question would be can DataFrame/Sets take advantage
of this physical bucketing upon read of the parquet file for something
like a self-join on the
Hello All,
I’m trying to improve join efficiency within (self-join) and across
data sets loaded from different parquet files primarily due to a
multi-stage data ingestion environment.
Are there specific benefits to shuffling efficiency (e.g. no network
transmission) if the parquet files are writt