Persisting sorted parquet tables for future sort merge joins

Jason Tue, 25 Aug 2015 17:12:08 -0700

I want to persist a large _sorted_ table to Parquet on S3 and then read
this in and join it using the Sorted Merge Join strategy against another
large sorted table.


The problem is: even though I sort these tables on the join key beforehand,
once I persist them to Parquet, they lose the information about their
sortedness. Is there anyway to hint to Spark that they do not need to be
resorted the next time I read them in?

I've been trying this on 1.5 and I keep getting plans looking like:
[== Physical Plan ==]
[TungstenProject [pos#28400,workf...#28399]]
[ SortMergeJoin [CHROM#28403,pos#28400], [CHROM#28399,pos#28332]]
[  TungstenSort [CHROM#28403 ASC,pos#28400 ASC], false, 0]
[   TungstenExchange hashpartitioning(CHROM#28403,pos#28400)]
[    ConvertToUnsafe]
[     Scan ParquetRelation[file:/....sorted.parquet][pos#284....8424]]
[  TungstenSort [CHROM#28399 ASC,pos#28332 ASC], false, 0]
[   TungstenExchange hashpartitioning(CHROM#28399,pos#28332)]
[    ConvertToUnsafe]
[     Scan ParquetRelation[file:....exploded_sorted.parquet][pos#2.....399]]

Thanks,
Jason

Persisting sorted parquet tables for future sort merge joins

Reply via email to