Re: Random Forest MLlib

2015-09-10 Thread Maximo Gurmendez
Hi Yasemin, We had the same question and found this: https://issues.apache.org/jira/browse/SPARK-6884 Thanks, Maximo On Sep 10, 2015, at 9:09 AM, Yasemin Kaya > wrote: Hi , I am using Random Forest Alg. for recommendation system. I get users

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez
g' DataFrame? Basically, the equivalent of writing by partition and creating a DataFrame for each result, but skipping the HDFS step. On Tue, Sep 8, 2015 at 10:47 AM, Maximo Gurmendez <mgurmen...@dataxu.com<mailto:mgurmen...@dataxu.com>> wrote: Hi, I have a RDD that needs to

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez
ly in bigRdd) 2) The caching happens in a way that preserves the partitioning by client Id (and the locality) Thanks, Maximo PD: I am aware that this might cause imbalance of data, but I can probably mitigate that with a smarter partitioner. On Sep 9, 2015, at 9:30 AM, Maximo Gurmendez <mg

Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Maximo Gurmendez
Hi, I have a RDD that needs to be split (say, by client) in order to train n models (i.e. one for each client). Since most of the classifiers that come with ml-lib only can accept an RDD as input (and cannot build multiple models in one pass - as I understand it), the only way to train n

Hadoop Distributed Cache

2014-09-10 Thread Maximo Gurmendez
Hi, As part of SparkContext.newAPIHadoopRDD(). Would Spark support an InputFormat that uses Hadoop’s distributed cache? Thanks, Máximo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,