Can you try reducing maxBins?  That reduces communication (at the cost of
coarser discretization of continuous features).

On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> In my experience, 20K is a lot but often doable; 2K is easy; 200 is
> small.  Communication scales linearly in the number of features.
>
> On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Joseph,
>>
>> Correction, there 20k features. Is it still a lot?
>> What number of features can be considered as normal?
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> First thought: 70K features is *a lot* for the MLlib implementation (and
>>> any PLANET-like implementation)
>>>
>>> Using fewer partitions is a good idea.
>>>
>>> Which Spark version was this on?
>>>
>>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
>>>> The questions I have in mind:
>>>>
>>>> Is it smth that the one might expect? From the stack trace itself it's
>>>> not clear where does it come from.
>>>> Is it an already known bug? Although I haven't found anything like that.
>>>> Is it possible to configure something to workaround / avoid this?
>>>>
>>>> I'm not sure it's the right thing to do, but I've
>>>>     increased thread stack size 10 times (to 80MB)
>>>>     reduced default parallelism 10 times (only 20 cores are available)
>>>>
>>>> Thank you in advance.
>>>>
>>>> --
>>>> Be well!
>>>> Jean Morozov
>>>>
>>>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>>>> evgeny.a.moro...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a web service that provides rest api to train random forest
>>>>> algo.
>>>>> I train random forest on a 5 nodes spark cluster with enough memory -
>>>>> everything is cached (~22 GB).
>>>>> On a small datasets up to 100k samples everything is fine, but with
>>>>> the biggest one (400k samples and ~70k features) I'm stuck with
>>>>> StackOverflowError.
>>>>>
>>>>> Additional options for my web service
>>>>>     spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>>>>     spark.default.parallelism = 200.
>>>>>
>>>>> On a 400k samples dataset
>>>>> - (with default thread stack size) it took 4 hours of training to get
>>>>> the error.
>>>>> - with increased stack size it took 60 hours to hit it.
>>>>> I can increase it, but it's hard to say what amount of memory it needs
>>>>> and it's applied to all of the treads and might waste a lot of memory.
>>>>>
>>>>> I'm looking at different stages at event timeline now and see that
>>>>> task deserialization time gradually increases. And at the end task
>>>>> deserialization time is roughly same as executor computing time.
>>>>>
>>>>> Code I use to train model:
>>>>>
>>>>> int MAX_BINS = 16;
>>>>> int NUM_CLASSES = 0;
>>>>> double MIN_INFO_GAIN = 0.0;
>>>>> int MAX_MEMORY_IN_MB = 256;
>>>>> double SUBSAMPLING_RATE = 1.0;
>>>>> boolean USE_NODEID_CACHE = true;
>>>>> int CHECKPOINT_INTERVAL = 10;
>>>>> int RANDOM_SEED = 12345;
>>>>>
>>>>> int NODE_SIZE = 5;
>>>>> int maxDepth = 30;
>>>>> int numTrees = 50;
>>>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>>>>> maxDepth, NUM_CLASSES, MAX_BINS,
>>>>>         QuantileStrategy.Sort(), new 
>>>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
>>>>>         MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>>>>> CHECKPOINT_INTERVAL);
>>>>> RandomForestModel model = 
>>>>> RandomForest.trainRegressor(labeledPoints.rdd(), strategy, numTrees, 
>>>>> "auto", RANDOM_SEED);
>>>>>
>>>>>
>>>>> Any advice would be highly appreciated.
>>>>>
>>>>> The exception (~3000 lines long):
>>>>>  java.lang.StackOverflowError
>>>>>         at
>>>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>>>>>         at
>>>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>>>>>         at
>>>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>>>>>         at
>>>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>>>>>         at
>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>>>>>         at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>         at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>>>>         at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>>>>         at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>         at
>>>>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>>>>>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:497)
>>>>>         at
>>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>>>>         at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>>>>         at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>         at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>>>>         at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>>>>         at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>         at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>>>>         at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>>>>         at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>         at
>>>>> scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>>>>>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:497)
>>>>>         at
>>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>>>>         at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>>>>         at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>         at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>         at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>>>>         at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>>>>
>>>>> --
>>>>> Be well!
>>>>> Jean Morozov
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to