Datasets and Partitioners

Darin McBeath Tue, 06 Sep 2016 11:14:48 -0700

How do you find the partitioner for a Dataset?

I have a Dataset (om) which I created and repartitioned using one of the fields 
(docId).  Reading the documentation, I would assume the om Dataset should be 
hash partitioned.  But, how can I verify this?


When I do om.rdd.partitioner I get 

Option[org.apache.spark.Partitioner] = None

But, perhaps this is not equivalent.

The reason I ask is that when I use this cached Dataset in a join with another 
Dataset (partitioned on the same column and cached) I see things like the 
following in my explain which makes me think the Dataset might have lost the 
partitioner.  I also see a couple of stages for the job where it seems like 
each Dataset in my join is being read in and shuffled out again (I'm assuming 
for the hash partitioning required by the join)

Exchange hashpartitioning(_1#6062.docId, 8)

Any thoughts/ideas would be appreciated.

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Datasets and Partitioners

Reply via email to