Re: Can't cache RDD of collaborative filtering on MLlib

Y. Sakamoto Thu, 12 Mar 2015 00:41:44 -0700

Hello.

I tried `count()`, then `userJavaRDD` and `productJavaRDD` were cached,
and the speed became faster.


Thank you.


On 2015/03/10 4:05, Xiangrui Meng wrote:

cache() is lazy. The data is stored into memory after the first time
it gets materialized. So the first time you call `predict` after you
load the model back from HDFS, it still takes time to load the actual
data. The second time will be much faster. Or you can call
`userJavaRDD.count()` and `productJavaRDD.count()` explicitly to load
both into memory before you create the model. -Xiangrui

On Sun, Mar 8, 2015 at 9:43 AM, Yuichiro Sakamoto
<ks...@muc.biglobe.ne.jp> wrote:

Hello.

I create program, collaborative filtering using Spark,
but I have trouble with calculating speed.

I want to implement recommendation program using ALS (MLlib),
which is another process from Spark.
But access speed of MatrixFactorizationModel object on HDFS is slow,
so I want to cache it, but I can't.

There are 2 processes:

process A:

   1. Create MatrixFactorizationModel by ALS

   2. Save following objects to HDFS
     - MatrixFactorizationModel (on RDD)
     - MatrixFactorizationModel#userFeatures(RDD)
     - MatrixFactorizationModel#productFeatures(RDD)

process B:

   1. Load model information saved by process A.
      # In process B, Master of SparkContext is set to "local"
     ==========
     // Read Model
     JavaRDD<MatrixFactorizationModel> modelRDD =
sparkContext.objectFile("<HDFS path>");
     MatrixFactorizationModel preModel = modelData.first();
     // Read Model's RDD
     JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
sparkContext.objectFile("<HDFS path>");
     JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
sparkContext.objectFile("<HDFS path>");
     // Create Model
     MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
         JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
     ==========

   2. Call "predict" method of above MatrixFactorizationModel object.


At number 2 of process B, it is slow speed because objects are read from
HDFS every time.
# I confirmed that the result of recommendation is correct.

So, I tried to cache "productJavaRDD" and "userJavaRDD" as following,
but there was no response from "predict" method.
==========
// Read Model
JavaRDD<MatrixFactorizationModel> modelRDD = sparkContext.objectFile("<HDFS
path>");
MatrixFactorizationModel preModel = modelData.first();
// Read Model's RDD
JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
sparkContext.objectFile("<HDFS path>");
JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
sparkContext.objectFile("<HDFS path>");
// Cache
productJavaRDD.cache();
userJavaRDD.cache();
// Create Model
MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
     JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
==========

I could not understand why "predict" method was frozen.
Could you please help me how to cache object ?

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



--
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*
    Yuichiro SAKAMOTO
        - ks...@muc.biglobe.ne.jp
        - phonypian...@gmail.com
        - http://www2u.biglobe.ne.jp/~yuichi/
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can't cache RDD of collaborative filtering on MLlib

Reply via email to