Hello.
I tried `count()`, then `userJavaRDD` and `productJavaRDD` were cached,
and the speed became faster.
Thank you.
On 2015/03/10 4:05, Xiangrui Meng wrote:
cache() is lazy. The data is stored into memory after the first time
it gets materialized. So the first time you call `predict` after you
load the model back from HDFS, it still takes time to load the actual
data. The second time will be much faster. Or you can call
`userJavaRDD.count()` and `productJavaRDD.count()` explicitly to load
both into memory before you create the model. -Xiangrui
On Sun, Mar 8, 2015 at 9:43 AM, Yuichiro Sakamoto
<ks...@muc.biglobe.ne.jp> wrote:
Hello.
I create program, collaborative filtering using Spark,
but I have trouble with calculating speed.
I want to implement recommendation program using ALS (MLlib),
which is another process from Spark.
But access speed of MatrixFactorizationModel object on HDFS is slow,
so I want to cache it, but I can't.
There are 2 processes:
process A:
1. Create MatrixFactorizationModel by ALS
2. Save following objects to HDFS
- MatrixFactorizationModel (on RDD)
- MatrixFactorizationModel#userFeatures(RDD)
- MatrixFactorizationModel#productFeatures(RDD)
process B:
1. Load model information saved by process A.
# In process B, Master of SparkContext is set to "local"
==========
// Read Model
JavaRDD<MatrixFactorizationModel> modelRDD =
sparkContext.objectFile("<HDFS path>");
MatrixFactorizationModel preModel = modelData.first();
// Read Model's RDD
JavaRDD<Tuple2<Object, double[]>> productJavaRDD =
sparkContext.objectFile("<HDFS path>");
JavaRDD<Tuple2<Object, double[]>> userJavaRDD =
sparkContext.objectFile("<HDFS path>");
// Create Model
MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
==========
2. Call "predict" method of above MatrixFactorizationModel object.
At number 2 of process B, it is slow speed because objects are read from
HDFS every time.
# I confirmed that the result of recommendation is correct.
So, I tried to cache "productJavaRDD" and "userJavaRDD" as following,
but there was no response from "predict" method.
==========
// Read Model
JavaRDD<MatrixFactorizationModel> modelRDD = sparkContext.objectFile("<HDFS
path>");
MatrixFactorizationModel preModel = modelData.first();
// Read Model's RDD
JavaRDD<Tuple2<Object, double[]>> productJavaRDD =
sparkContext.objectFile("<HDFS path>");
JavaRDD<Tuple2<Object, double[]>> userJavaRDD =
sparkContext.objectFile("<HDFS path>");
// Cache
productJavaRDD.cache();
userJavaRDD.cache();
// Create Model
MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
==========
I could not understand why "predict" method was frozen.
Could you please help me how to cache object ?
Thank you.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*
Yuichiro SAKAMOTO
- ks...@muc.biglobe.ne.jp
- phonypian...@gmail.com
- http://www2u.biglobe.ne.jp/~yuichi/
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org