Hi All, I want to hash partition (and then cache) a schema RDD in way that partitions are based on hash of the values of a column ("ID" column in my case).
e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and spark.sql.shuffle.partitions is configured as 3, then there should be 3 partitions and say for ID=1, all the tuples should be present in one particular partition. My actual use case is that I always get a query in which I have to join 2 cached tables on ID column, so it first partitions both tables on ID and then apply JOIN and I want to avoid the partitioning based on ID by preprocessing it (and then cache it). Thanks in Advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org