How to find the partitioner for a Dataset

Darin McBeath Wed, 07 Sep 2016 05:20:06 -0700

I have a Dataset (om) which I created and repartitioned (and cached) using
one of the fields (docId).  Reading the Spark documentation, I would assume
the om Dataset should be hash partitioned.  But, how can I verify this?


When I do om.rdd.partitioner I get 

Option[org.apache.spark.Partitioner] = None

I thought I would have seen HashPartitioner.  But, perhaps this is not
equivalent.

The reason I ask is that when I use this cached Dataset in a join with
another Dataset (partitioned on the same column and cached) I see things
like the following in my explain which makes me think the Dataset might have
lost the partitioner.  I also see a couple of stages for the job where it
seems like each Dataset in my join is being read in and shuffled out again
(I'm assuming for the hash partitioning required by the join)

Exchange hashpartitioning(_1#6062.docId, 8)

Any thoughts/ideas would be appreciated.

Thanks.

Darin.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-the-partitioner-for-a-Dataset-tp27672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How to find the partitioner for a Dataset

Reply via email to