Thanks Ben for your answer. I’ll explore what happens under the hoods in a
data frame.
Regarding the ability to split a large RDD into n RDDs without requiring n
passes to the large RDD. Can partitionBy() help? If I partition by a key that
corresponds to the the split criteria (i..e client
Adding an example (very raw), to see if my understanding is correct:
val repartitioned = bidRdd.partitionBy(new Partitioner {
def numPartitions: Int = 100
def getPartition(clientId: Any): Int = hash(clientId) % 100
}
val cachedRdd = repartitioned.cache()
val client1Rdd =
Hi Maximo —
This is a relatively naive answer, but I would consider structuring the RDD
into a DataFrame, then saving the 'splits' using something like
DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then
read a DataFrame from each resulting parquet directory and do your
Hi,
I have a RDD that needs to be split (say, by client) in order to train n
models (i.e. one for each client). Since most of the classifiers that come with
ml-lib only can accept an RDD as input (and cannot build multiple models in one
pass - as I understand it), the only way to train n