That seems like it could work, although I don't think `partitionByKey` is a
thing, at least for RDD. You might be able to merge step #2 and step #3
into one step by using the `reduceByKey` function signature that takes in a
Partitioner implementation.

def reduceByKey(partitioner: Partitioner
<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/Partitioner.html>
, func: (V, V) ⇒ V): RDD
<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>
[(K, V)]

Merge the values for each key using an associative reduce function. This
will also perform the merging locally on each mapper before sending results
to a reducer, similarly to a "combiner" in MapReduce.

The tricky part might be getting the partitioner to know about the number
of partitions, which I think it needs to know upfront in `abstract def
numPartitions: Int`. The `HashPartitioner` for example takes in the number
as a constructor argument, maybe you could use that with an upper bound
size if you don't mind empty partitions. Otherwise you might have to mess
around to extract the exact number of keys if it's not readily available.

Aside: what is the requirement to have each partition only contain the data
related to one key?

On Fri, Sep 4, 2015 at 11:06 AM, mmike87 <mwri...@snl.com> wrote:

> Hello, I am new to Apache Spark and this is my company's first Spark
> project.
> Essentially, we are calculating models dealing with Mining data using
> Spark.
>
> I am holding all the source data in a persisted RDD that we will refresh
> periodically. When a "scenario" is passed to the Spark job (we're using Job
> Server) the persisted RDD is filtered to the relevant mines. For example,
> we
> may want all mines in Chile and the 1990-2015 data for each.
>
> Many of the calculations are cumulative, that is when we apply user-input
> "adjustment factors" to a value, we also need the "flexed" value we
> calculated for that mine previously.
>
> To ensure that this works, the idea if to:
>
> 1) Filter the superset to relevant mines (done)
> 2) Group the subset by the unique identifier for the mine. So, a group may
> be all the rows for mine "A" for 1990-2015
> 3) I then want to ensure that the RDD is partitioned by the Mine Identifier
> (and Integer).
>
> It's step 3 that is confusing me. I suspect it's very easy ... do I simply
> use PartitionByKey?
>
> We're using Java if that makes any difference.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Reply via email to