Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features).
On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley <jos...@databricks.com> wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. Communication scales linearly in the number of features. > > On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> Joseph, >> >> Correction, there 20k features. Is it still a lot? >> What number of features can be considered as normal? >> >> -- >> Be well! >> Jean Morozov >> >> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley <jos...@databricks.com> >> wrote: >> >>> First thought: 70K features is *a lot* for the MLlib implementation (and >>> any PLANET-like implementation) >>> >>> Using fewer partitions is a good idea. >>> >>> Which Spark version was this on? >>> >>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov < >>> evgeny.a.moro...@gmail.com> wrote: >>> >>>> The questions I have in mind: >>>> >>>> Is it smth that the one might expect? From the stack trace itself it's >>>> not clear where does it come from. >>>> Is it an already known bug? Although I haven't found anything like that. >>>> Is it possible to configure something to workaround / avoid this? >>>> >>>> I'm not sure it's the right thing to do, but I've >>>> increased thread stack size 10 times (to 80MB) >>>> reduced default parallelism 10 times (only 20 cores are available) >>>> >>>> Thank you in advance. >>>> >>>> -- >>>> Be well! >>>> Jean Morozov >>>> >>>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov < >>>> evgeny.a.moro...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a web service that provides rest api to train random forest >>>>> algo. >>>>> I train random forest on a 5 nodes spark cluster with enough memory - >>>>> everything is cached (~22 GB). >>>>> On a small datasets up to 100k samples everything is fine, but with >>>>> the biggest one (400k samples and ~70k features) I'm stuck with >>>>> StackOverflowError. >>>>> >>>>> Additional options for my web service >>>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192" >>>>> spark.default.parallelism = 200. >>>>> >>>>> On a 400k samples dataset >>>>> - (with default thread stack size) it took 4 hours of training to get >>>>> the error. >>>>> - with increased stack size it took 60 hours to hit it. >>>>> I can increase it, but it's hard to say what amount of memory it needs >>>>> and it's applied to all of the treads and might waste a lot of memory. >>>>> >>>>> I'm looking at different stages at event timeline now and see that >>>>> task deserialization time gradually increases. And at the end task >>>>> deserialization time is roughly same as executor computing time. >>>>> >>>>> Code I use to train model: >>>>> >>>>> int MAX_BINS = 16; >>>>> int NUM_CLASSES = 0; >>>>> double MIN_INFO_GAIN = 0.0; >>>>> int MAX_MEMORY_IN_MB = 256; >>>>> double SUBSAMPLING_RATE = 1.0; >>>>> boolean USE_NODEID_CACHE = true; >>>>> int CHECKPOINT_INTERVAL = 10; >>>>> int RANDOM_SEED = 12345; >>>>> >>>>> int NODE_SIZE = 5; >>>>> int maxDepth = 30; >>>>> int numTrees = 50; >>>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), >>>>> maxDepth, NUM_CLASSES, MAX_BINS, >>>>> QuantileStrategy.Sort(), new >>>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN, >>>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, >>>>> CHECKPOINT_INTERVAL); >>>>> RandomForestModel model = >>>>> RandomForest.trainRegressor(labeledPoints.rdd(), strategy, numTrees, >>>>> "auto", RANDOM_SEED); >>>>> >>>>> >>>>> Any advice would be highly appreciated. >>>>> >>>>> The exception (~3000 lines long): >>>>> java.lang.StackOverflowError >>>>> at >>>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320) >>>>> at >>>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333) >>>>> at >>>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828) >>>>> at >>>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453) >>>>> at >>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512) >>>>> at >>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) >>>>> at >>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>> at >>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>>> at >>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>>> at >>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>> at >>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>> at >>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>>> at >>>>> scala.collection.immutable.$colon$colon.readObject(List.scala:366) >>>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>>> at >>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) >>>>> at >>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) >>>>> at >>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>> at >>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>> at >>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>>> at >>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>>> at >>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>> at >>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>> at >>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>>> at >>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>>> at >>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>> at >>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>> at >>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>>> at >>>>> scala.collection.immutable.$colon$colon.readObject(List.scala:362) >>>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>>> at >>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) >>>>> at >>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) >>>>> at >>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>> at >>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>> at >>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>>> at >>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>>> >>>>> -- >>>>> Be well! >>>>> Jean Morozov >>>>> >>>> >>>> >>> >> >