Hello, I am new to Apache Spark and this is my company's first Spark project. Essentially, we are calculating models dealing with Mining data using Spark.
I am holding all the source data in a persisted RDD that we will refresh periodically. When a "scenario" is passed to the Spark job (we're using Job Server) the persisted RDD is filtered to the relevant mines. For example, we may want all mines in Chile and the 1990-2015 data for each. Many of the calculations are cumulative, that is when we apply user-input "adjustment factors" to a value, we also need the "flexed" value we calculated for that mine previously. To ensure that this works, the idea if to: 1) Filter the superset to relevant mines (done) 2) Group the subset by the unique identifier for the mine. So, a group may be all the rows for mine "A" for 1990-2015 3) I then want to ensure that the RDD is partitioned by the Mine Identifier (and Integer). It's step 3 that is confusing me. I suspect it's very easy ... do I simply use PartitionByKey? We're using Java if that makes any difference. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-to-Spark-Paritioning-Question-tp24580.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org