Re: LogisticRegression models consumes all driver memory

2015-09-25 Thread Eugene Zhulenev
Problem turned out to be in too high 'spark.default.parallelism', BinaryClassificationMetrics are doing combineByKey which internally shuffle train dataset. Lower parallelism + cutting train set RDD history with save/read into parquet solved the problem. Thanks for hint! On Wed, Sep 23, 2015 at

LogisticRegression models consumes all driver memory

2015-09-23 Thread Eugene Zhulenev
We are running Apache Spark 1.5.0 (latest code from 1.5 branch) We are running 2-3 LogisticRegression models in parallel (we'd love to run 10-20 actually), they are not really big at all, maybe 1-2 million rows in each model. Cluster itself, and all executors look good. Enough free memory and no

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
You want to reduce the # of partitions to around the # of executors * cores. Since you have so many tasks/partitions which will give a lot of pressure on treeReduce in LoR. Let me know if this helps. Sincerely, DB Tsai -- Blog:

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Could you paste some of your code for diagnosis? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Your code looks correct for me. How many # of features do you have in this training? How many tasks are running in the job? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread Eugene Zhulenev
~3000 features, pretty sparse, I think about 200-300 non zero features in each row. We have 100 executors x 8 cores. Number of tasks is pretty big, 30k-70k, can't remember exact number. Training set is a result of pretty big join from multiple data frames, but it's cached. However as I understand