Yes. But you need to store RDD as *serialized* Java objects. See the
session of storage level
http://spark.apache.org/docs/latest/programming-guide.html
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn:
For saving the memory, I recommend you compress the cached RDD, and it will
be couple times smaller than original data sets.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Wed, Sep 3,
Hi Xiangrui,
A side-by question about MLLib.
It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support
L2 regurization, the doc explains it: The L1 regularization by using
L1Updater
+DB David (They implemented QWLQN on Spark today.)
On Sep 3, 2014 7:18 PM, Jiusheng Chen chenjiush...@gmail.com wrote:
Hi Xiangrui,
A side-by question about MLLib.
It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support
L2 regurization, the doc explains it: The L1
With David's help today, we were able to implement elastic net glm in
Spark. It's surprising easy, and with just some modification in breeze's
OWLQN code, it just works without further investigation.
We did benchmark, and the coefficients are within 0.5% differences compared
with R's glmnet
Thanks DB and Xiangrui. Glad to know you guys are actively working on it.
Another thing, did we evaluate the loss of using Float to store values?
currently it is Double. Use fewer bits has the benifit of memory footprint
reduction. According to Google, they even uses 16 bits (a special encoding
Update.
I just find a magic parameter *blanceSlack* in *CoalescedRDD*, which sounds
could control the locality. The default value is 0.1 (smaller value means
lower locality). I change it to 1.0 (full locality) and use #3 approach,
then find a lot improvement (20%~40%). Although the Web UI still
Assuming that your data is very sparse, I would recommend
RDD.repartition. But if it is not the case and you don't want to
shuffle the data, you can try a CombineInputFormat and then parse the
lines into labeled points. Coalesce may cause locality problems if you
didn't use the right number of
Hi Xiangrui,
Thanks for your reply!
Yes, our data is very sparse, but RDD.repartition invoke
RDD.coalesce(numPartitions, shuffle = true) internally, so I think it has
the same effect with #2, right?
For CombineInputFormat, although I haven't tried it, but it sounds that it
will combine multiple
Sorry, I missed #2. My suggestion is the same as #2. You need to set a
bigger numPartitions to avoid hitting integer bound or 2G limitation,
at the cost of increased shuffle size per iteration. If you use a
CombineInputFormat and then cache, it will try to give you roughly the
same size per
Hi all,
We are trying to use Spark MLlib to train super large data (100M features
and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib
will create a task for every partition at each iteration. But because our
dimensions are also very high, such large number of tasks will
How about increase HDFS file extent size? like current value is 128M, we
make it 512M or bigger.
On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong dong...@gmail.com wrote:
Hi all,
We are trying to use Spark MLlib to train super large data (100M features
and 5B rows). The input data in HDFS
I think this has the same effect and issue with #1, right?
On Tue, Aug 12, 2014 at 1:08 PM, Jiusheng Chen chenjiush...@gmail.com
wrote:
How about increase HDFS file extent size? like current value is 128M, we
make it 512M or bigger.
On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong
13 matches
Mail list logo