Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread jluan
I was wondering, is there a way to force something like the hash partitioner to use the entire entry of a PairRDD as a hash rather than just the key? For Example, if we have an RDD with values: PairRDD = [(1,4), (1, 3), (2, 3), (2,5), (2, 10)]. Rather than using keys 1 and 2, can we force the

[MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread jluan
I've trained a RandomForest classifier where I can print my model's decisions using model.toDebugString However I was wondering if there's a way to extract tree programmatically by traversing the nodes or in some other way such that I can write my own decision file rather than just a debug

RangePartitioning skewed data

2016-01-25 Thread jluan
Lets say I have a dataset of (K,V) where the keys are really skewed: myDataRDD = [(8, 1), (8, 13), (1,1), (2,4)] [(8, 12), (8, 15), (8, 7), (8, 6), (8, 4), (8, 3), (8, 4), (10,2)] If I applied a RangePartitioner to this set of data, say val rangePart = new RangePartitioner(4, myDataRDD) and

how garbage collection works on parallelize

2016-01-08 Thread jluan
Hi, I am curious about garbage collect on an object which gets parallelized. Say if we have a really large array (say 40GB in ram) that we want to parallelize across our machines. I have the following function: def doSomething(): RDD[Double] = { val reallyBigArray = Array[Double[(some really

Re: Spark ClosureCleaner or java serializer OOM when trying to grow

2015-09-24 Thread jluan
With spark.serializer.objectStreamReset set to 1, I ran a sample scala test code which still seems to be crashing at the same place. If someone could verify this independently, I would greatly appreciate it. Scala Code: -- import

Spark ClosureCleaner or java serializer OOM when trying to grow

2015-09-23 Thread jluan
I have been stuck on this problem for the last few days: I am attempting to run random forest from MLLIB, it gets through most of it, but breaks when doing a mapPartition operation. The following stack trace is shown: : An error occurred while calling o94.trainRandomForestModel. :

DecisionTree hangs, then crashes

2015-09-17 Thread jluan
See my stack overflow questions for better formatted info: http://stackoverflow.com/questions/32621267/spark-1-5-0-hangs-running-randomforest I am trying to run a basic decision tree from MLLIB. My spark