I am running my own PySpark application (solving matrix factorization using
Gemulla's DSGD algorithm). The program seemed to work fine on smaller
movielens dataset but failed on larger Netflix data. It too about 14 hours
to complete two iterations and lost an executor (I used totally 8 executors
I was able to run collaborative filtering with low rank numbers, like 20~160
on the netflix dataset, but it fails due to the following error when I set
the rank to 1000:
14/10/03 03:27:36 WARN TaskSetManager: Loss was due to
java.lang.IllegalArgumentException
java.lang.IllegalArgumentException:
Thanks, Xiangrui.
I didn't check the test error yet. I agree that rank 1000 might overfit for
this particular dataset. Currently I'm just running some scalability tests -
I'm trying to see how large the model can be scaled to given a fixed amount
of hardware.
--
View this message in context: