That seems like it could work, although I don't think `partitionByKey` is a thing, at least for RDD. You might be able to merge step #2 and step #3 into one step by using the `reduceByKey` function signature that takes in a Partitioner implementation.
def reduceByKey(partitioner: Partitioner <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html> , func: (V, V) ⇒ V): RDD <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html> [(K, V)] Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. The tricky part might be getting the partitioner to know about the number of partitions, which I think it needs to know upfront in `abstract def numPartitions: Int`. The `HashPartitioner` for example takes in the number as a constructor argument, maybe you could use that with an upper bound size if you don't mind empty partitions. Otherwise you might have to mess around to extract the exact number of keys if it's not readily available. Aside: what is the requirement to have each partition only contain the data related to one key? On Fri, Sep 4, 2015 at 11:06 AM, mmike87 <mwri...@snl.com> wrote: > Hello, I am new to Apache Spark and this is my company's first Spark > project. > Essentially, we are calculating models dealing with Mining data using > Spark. > > I am holding all the source data in a persisted RDD that we will refresh > periodically. When a "scenario" is passed to the Spark job (we're using Job > Server) the persisted RDD is filtered to the relevant mines. For example, > we > may want all mines in Chile and the 1990-2015 data for each. > > Many of the calculations are cumulative, that is when we apply user-input > "adjustment factors" to a value, we also need the "flexed" value we > calculated for that mine previously. > > To ensure that this works, the idea if to: > > 1) Filter the superset to relevant mines (done) > 2) Group the subset by the unique identifier for the mine. So, a group may > be all the rows for mine "A" for 1990-2015 > 3) I then want to ensure that the RDD is partitioned by the Mine Identifier > (and Integer). > > It's step 3 that is confusing me. I suspect it's very easy ... do I simply > use PartitionByKey? > > We're using Java if that makes any difference. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>