Hi Suraj, I don't see any logs from mllib. You might need to explicit set the logging to DEBUG for mllib. Adding this line for log4j.properties might fix the problem. log4j.logger.org.apache.spark.mllib.tree=DEBUG
Also, please let me know if you can encounter similar problems with the Spark master. -Manish On Sat, Jun 14, 2014 at 3:19 AM, SURAJ SHETH <shet...@gmail.com> wrote: > Hi Manish, > Thanks for your reply. > > I am attaching the logs here(regression, 5 levels). It contains the last > 100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4 > levels complete in less than 6 seconds, while the 5th level doesn't > complete even after several hours. > Due to the reason that this is somebody else's data, I can't share it. > > Can you check the code snippet attached in my first email and see if it > needs something to enable it to work for large data and >= 5 levels. It is > working for 3 levels on the same dataset, but, not for 5 levels. > > In the mean time, I will try to run it on the latest master and let you > know the results. If it runs fine there, then, it can be related to 128 MB > limit issue that you mentioned. > > Thanks and Regards, > Suraj Sheth > > > > On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde <manish...@gmail.com> wrote: > >> Hi Suraj, >> >> I can't answer 1) without knowing the data. However, the results for 2) >> are surprising indeed. We have tested with a billion samples for regression >> tasks so I am perplexed with the behavior. >> >> Could you try the latest Spark master to see whether this problem goes >> away. It has code that limits memory consumption at the master and worker >> nodes to 128 MB by default which ideally should not be needed given the >> amount of RAM on your cluster. >> >> Also, feel free to send the DEBUG logs. It might give me a better idea of >> where the algorithm is getting stuck. >> >> -Manish >> >> >> >> On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH <shet...@gmail.com> wrote: >> >>> Hi Filipus, >>> The train data is already oversampled. >>> The number of positives I mentioned above is for the test dataset : >>> 12028 (apologies for not making this clear earlier) >>> The train dataset has 61,264 positives out of 689,763 total rows. The >>> number of negatives is 628,499. >>> Oversampling was done for the train dataset to ensure that we have >>> atleast 9-10% of positives in the train part >>> No oversampling is done for the test dataset. >>> >>> So, the only difference that remains is the amount of data used for >>> building a tree. >>> >>> But, I have a few more questions : >>> Have we tried how much data can be used at most to build a single >>> Decision Tree. >>> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of >>> train data and 30x3 GB of RAM), I would expect it to build a single >>> Decision Tree with all the data without any issues. But, for maxDepth >= 5, >>> it is not able to. I confirmed that when it keeps running for hours, the >>> amount of free memory available is more than 70%. So, it doesn't seem to be >>> a Memory issue either. >>> >>> >>> Thanks and Regards, >>> Suraj Sheth >>> >>> >>> On Wed, Jun 11, 2014 at 10:19 PM, filipus <floe...@gmail.com> wrote: >>> >>>> well I guess your problem is quite unbalanced and due to the information >>>> value as a splitting criterion I guess the algo stops after very view >>>> splits >>>> >>>> work arround is oversampling >>>> >>>> build many training datasets like >>>> >>>> take randomly 50% of the positives and from the negativ the same amount >>>> or >>>> let say the double >>>> >>>> => 6000 positives and 12000 negatives >>>> >>>> build a tree >>>> >>>> this you do many times => many models (agents) >>>> >>>> and than you make an ensemble model. means vote all the model >>>> >>>> in a way similar two random forest but at the completely different >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>> >>> >> >