Hi there:
Yeah, I came to that same conclusion after tuning spark sql shuffle
parameter. Also cut out some classes I was using to parse my dataset and
finally created schema only with the fields needed for my model (before
that I was creating it with 63 fields while I just needed 15).
So I came wi
PS, I will recommend you compress the data when you cache the RDD.
There will be some overhead in compression/decompression, and
serialization/deserialization, but it will help a lot for iterative
algorithms with ability to caching more data.
Sincerely,
DB Tsai
---
Yeah, without caching makes it gets really slow. I will try to minimize the
number of columns on my tables, that may save lots of memory and will
eventually work.
I will let you know.
Thanks!
Gustavo
On Tue, Mar 3, 2015 at 8:58 PM, Joseph Bradley
wrote:
> I would recommend caching; if you can't
I would recommend caching; if you can't persist, iterative algorithms will
not work well.
I don't think calling count on the dataset is problematic; every iteration
in LBFGS iterates over the whole dataset and does a lot more computation
than count().
It would be helpful to see some error occurri
Yeah, I can call count before that and it works. Also I was over caching
tables but I removed those. Now there is no caching but it gets really slow
since it calculates my table RDD many times.
Also hacked the LBFGS code to pass the number of examples which I
calculated outside in a Spark SQL query
Is that error actually occurring in LBFGS? It looks like it might be
happening before the data even gets to LBFGS. (Perhaps the outer join
you're trying to do is making the dataset size explode a bit.) Are you
able to call count() (or any RDD action) on the data before you pass it to
LBFGS?
On
Just did with the same error.
I think the problem is the "data.count()" call in LBFGS because for huge
datasets that's naive to do.
I was thinking to write my version of LBFGS but instead of doing
data.count() I will pass that parameter which I will calculate from a Spark
SQL query.
I will let you
Can you try increasing your driver memory, reducing the executors and
increasing the executor memory?
Thanks
Best Regards
On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres <
gsala...@ime.usp.br> wrote:
> Hi there:
>
> I'm using LBFGS optimizer to train a logistic regression model.
Hi there:
I'm using LBFGS optimizer to train a logistic regression model. The code I
implemented follows the pattern showed in
https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training
data is obtained from a Spark SQL RDD.
The problem I'm having is that LBFGS tries to count the e