Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez
Thanks Ben for your answer. I’ll explore what happens under the hoods in a data frame. Regarding the ability to split a large RDD into n RDDs without requiring n passes to the large RDD. Can partitionBy() help? If I partition by a key that corresponds to the the split criteria (i..e client

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez
Adding an example (very raw), to see if my understanding is correct: val repartitioned = bidRdd.partitionBy(new Partitioner { def numPartitions: Int = 100 def getPartition(clientId: Any): Int = hash(clientId) % 100 } val cachedRdd = repartitioned.cache() val client1Rdd =

Re: Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Ben Tucker
Hi Maximo — This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame, then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read a DataFrame from each resulting parquet directory and do your

Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Maximo Gurmendez
Hi, I have a RDD that needs to be split (say, by client) in order to train n models (i.e. one for each client). Since most of the classifiers that come with ml-lib only can accept an RDD as input (and cannot build multiple models in one pass - as I understand it), the only way to train n