Re: LBGFS optimizer performace

2015-03-06 Thread Gustavo Enrique Salazar Torres
Hi there: Yeah, I came to that same conclusion after tuning spark sql shuffle parameter. Also cut out some classes I was using to parse my dataset and finally created schema only with the fields needed for my model (before that I was creating it with 63 fields while I just needed 15). So I came wi

Re: LBGFS optimizer performace

2015-03-05 Thread DB Tsai
PS, I will recommend you compress the data when you cache the RDD. There will be some overhead in compression/decompression, and serialization/deserialization, but it will help a lot for iterative algorithms with ability to caching more data. Sincerely, DB Tsai ---

Re: LBGFS optimizer performace

2015-03-04 Thread Gustavo Enrique Salazar Torres
Yeah, without caching makes it gets really slow. I will try to minimize the number of columns on my tables, that may save lots of memory and will eventually work. I will let you know. Thanks! Gustavo On Tue, Mar 3, 2015 at 8:58 PM, Joseph Bradley wrote: > I would recommend caching; if you can't

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
I would recommend caching; if you can't persist, iterative algorithms will not work well. I don't think calling count on the dataset is problematic; every iteration in LBFGS iterates over the whole dataset and does a lot more computation than count(). It would be helpful to see some error occurri

Re: LBGFS optimizer performace

2015-03-03 Thread Gustavo Enrique Salazar Torres
Yeah, I can call count before that and it works. Also I was over caching tables but I removed those. Now there is no caching but it gets really slow since it calculates my table RDD many times. Also hacked the LBFGS code to pass the number of examples which I calculated outside in a Spark SQL query

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
Is that error actually occurring in LBFGS? It looks like it might be happening before the data even gets to LBFGS. (Perhaps the outer join you're trying to do is making the dataset size explode a bit.) Are you able to call count() (or any RDD action) on the data before you pass it to LBFGS? On

Re: LBGFS optimizer performace

2015-03-03 Thread Gustavo Enrique Salazar Torres
Just did with the same error. I think the problem is the "data.count()" call in LBFGS because for huge datasets that's naive to do. I was thinking to write my version of LBFGS but instead of doing data.count() I will pass that parameter which I will calculate from a Spark SQL query. I will let you

Re: LBGFS optimizer performace

2015-03-02 Thread Akhil Das
Can you try increasing your driver memory, reducing the executors and increasing the executor memory? Thanks Best Regards On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres < gsala...@ime.usp.br> wrote: > Hi there: > > I'm using LBFGS optimizer to train a logistic regression model.

LBGFS optimizer performace

2015-03-02 Thread Gustavo Enrique Salazar Torres
Hi there: I'm using LBFGS optimizer to train a logistic regression model. The code I implemented follows the pattern showed in https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training data is obtained from a Spark SQL RDD. The problem I'm having is that LBFGS tries to count the e