Re: Preserving partitioning with dataframe select

Michael Armbrust Tue, 09 Feb 2016 11:13:44 -0800

RDD level partitioning information is not used to decide when to shuffle
for queries planned using Catalyst (since we have better information about
distribution from the query plan itself).  Instead you should be looking at
the logic in EnsureRequirements
<https://github.com/apache/spark/blob/06f0df6df204c4722ff8a6bf909abaa32a715c41/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L272>
.


We don't yet reason about equivalence classes for attributes when deciding
if a given partitioning is valid, but #10844
<https://github.com/apache/spark/pull/10844> is a start at building that
infrastructure.

Re: Preserving partitioning with dataframe select

Reply via email to