subject:"Equivalent Parquet File Repartitioning Benefits for Join\/Shuffle\?"

Re: Equivalent Parquet File Repartitioning Benefits for Join/Shuffle?

2016-10-20 Thread adam kramer

I believe what I am looking for is DataFrameWriter.bucketBy which would allow for bucketing into physical parquet files by the desired columns. Then my question would be can DataFrame/Sets take advantage of this physical bucketing upon read of the parquet file for something like a self-join on the

Equivalent Parquet File Repartitioning Benefits for Join/Shuffle?

2016-10-19 Thread adam kramer

Hello All, I’m trying to improve join efficiency within (self-join) and across data sets loaded from different parquet files primarily due to a multi-stage data ingestion environment. Are there specific benefits to shuffling efficiency (e.g. no network transmission) if the parquet files are