Scaling problem in RandomForest?

2015-03-11 Thread insperatum
Hi, the Random Forest implementation (1.2.1) is repeatably crashing when I increase the depth to 20. I generate random synthetic data (36 workers, 1,000,000 examples per worker, 30 features per example) as follows: val data = sc.parallelize(1 to 36, 36).mapPartitionsWithIndex((i, _) = {

Requested array size exceeds VM limit

2015-02-23 Thread insperatum
Hi,I'm using MLLib to train a random forest. It's working fine to depth 15, but if I use depth 20 I get a*java.lang.OutOfMemoryError: Requested array size exceeds VM limit* on the driver, from the collectAsMap operation in DecisionTree.scala, around line 642.It doesn't happen until a good hour

Caching RDDs with shared memory - bug or feature?

2014-12-09 Thread insperatum
If all RDD elements within a partition contain pointers to a single shared object, Spark persists as expected when the RDD is small. However, if the RDD is more than *200 elements* then Spark reports requiring much more memory than it actually does. This becomes a problem for large RDDs, as Spark

RDD with object shared across elements within a partition. Magic number 200?

2014-11-22 Thread insperatum
Hi all, I am trying to persist a spark RDD in which the elements of each partition all share access to a single, large object. However, this object seems get stored in memory several times. Reducing my problem down to the toy case of just a single partition with only 200 elements: *val*

Re: RDD with object shared across elements within a partition. Magic number 200?

2014-11-22 Thread insperatum
Some more details: Adding a println to the function reveals that it is indeed called only once. Furthermore, running: /rdd/.map(_.s.hashCode).min == /rdd/.map(_.s.hashCode).max // returns true ...reveals that all 1000 elements do indeed point to the same object, and so the data structure