New to Spark - Paritioning Question

mmike87 Fri, 04 Sep 2015 08:07:00 -0700

Hello, I am new to Apache Spark and this is my company's first Spark project.
Essentially, we are calculating models dealing with Mining data using Spark.


I am holding all the source data in a persisted RDD that we will refresh
periodically. When a "scenario" is passed to the Spark job (we're using Job
Server) the persisted RDD is filtered to the relevant mines. For example, we
may want all mines in Chile and the 1990-2015 data for each.

Many of the calculations are cumulative, that is when we apply user-input
"adjustment factors" to a value, we also need the "flexed" value we
calculated for that mine previously. 

To ensure that this works, the idea if to:

1) Filter the superset to relevant mines (done)
2) Group the subset by the unique identifier for the mine. So, a group may
be all the rows for mine "A" for 1990-2015
3) I then want to ensure that the RDD is partitioned by the Mine Identifier
(and Integer).

It's step 3 that is confusing me. I suspect it's very easy ... do I simply
use PartitionByKey?

We're using Java if that makes any difference.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

New to Spark - Paritioning Question

Reply via email to