How to avoid being killed by YARN node manager ?

2015-03-24 Thread Yuichiro Sakamoto
Hello.

We use ALS(Collaborative filtering) of Spark MLlib on YARN.
Spark version is 1.2.0 included CDH 5.3.1.

1,000,000,000 records(5,000,000 users data and 5,000,000 items data) are
used for machine learning with ALS.
These large quantities of data increases virtual memory usage, 
node manager of YARN kills Spark worker process.
Even though Spark run again after killing process, Spark worker process is
killed again.
As a result, the whole Spark processes are terminated.

# Spark worker process is killed, it seems that virtual memory usage
increased by 
# 'Shuffle' or 'Disk writing' gets over the threshold of YARN.

To avoid such a case from occurring, we use the method that
'yarn.nodemanager.vmem-check-enabled' is false, then exit successfully.
But it does not seem to have an appropriate way.
If you know, please let me know about tuning method of Spark.

The conditions of machines and Spark settings are as follows.
1)six machines, physical memory is 32GB of each machine.
2)Spark settings
- spark.executor.memory=16g
- spark.closure.serializer=org.apache.spark.serializer.KryoSerializer
- spark.rdd.compress=true
- spark.shuffle.memoryFraction=0.4

Thanks,
Yuichiro Sakamoto



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-being-killed-by-YARN-node-manager-tp22199.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-12 Thread Yuichiro Sakamoto
I got answer from mail posted to ML.

--- Summary ---
cache() is lazy, so you can use `RDD.count()` explicitly to load into
memory.
---

And I tried, two RDDs were cached and the speed became faster.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962p22000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-10 Thread Yuichiro Sakamoto
Thank you for your reply.


 1. Which version of Spark do you use now?

  I use Spark 1.2.0. (CDH 5.3.1)


 2. Why don't you check whether `productJavaRDD ` and `userJavaRDD ` are
 cached with Web UI or not?

  I checked SparkUI.
  Task was stopped at 1/2 (Succeeded/Total tasks).

  Here is stacktrace:

org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:840)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendUsers(MatrixFactorizationModel.scala:126)
...


 3. You can check a RDD cached or not with `getStorageLevel()` method, such
 as `model .userFeatures.getStorageLevel()`. 

  I printed the return value of getStorageLevel() userFeatures and
productFeatures,
  both were Memory Deserialized 1x Replicated .

  I think, two variables were configured to cache,
  but didn't cache at that time. (delayed ?)


Thanks,
Yuichiro Sakamoto



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962p21990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Can't cache RDD of collaborative filtering on MLlib

2015-03-08 Thread Yuichiro Sakamoto
Hello.

I create program, collaborative filtering using Spark,
but I have trouble with calculating speed.

I want to implement recommendation program using ALS (MLlib),
which is another process from Spark.
But access speed of MatrixFactorizationModel object on HDFS is slow,
so I want to cache it, but I can't.

There are 2 processes:

process A:

  1. Create MatrixFactorizationModel by ALS

  2. Save following objects to HDFS
- MatrixFactorizationModel (on RDD)
- MatrixFactorizationModel#userFeatures(RDD)
- MatrixFactorizationModel#productFeatures(RDD)

process B:

  1. Load model information saved by process A.
 # In process B, Master of SparkContext is set to local
==
// Read Model
JavaRDDMatrixFactorizationModel modelRDD =
sparkContext.objectFile(HDFS path);
MatrixFactorizationModel preModel = modelData.first();
// Read Model's RDD
JavaRDDTuple2lt;Object, double[] productJavaRDD =
sparkContext.objectFile(HDFS path);
JavaRDDTuple2lt;Object, double[] userJavaRDD =
sparkContext.objectFile(HDFS path);
// Create Model
MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
==

  2. Call predict method of above MatrixFactorizationModel object.


At number 2 of process B, it is slow speed because objects are read from
HDFS every time.
# I confirmed that the result of recommendation is correct.

So, I tried to cache productJavaRDD and userJavaRDD as following,
but there was no response from predict method.
==
// Read Model
JavaRDDMatrixFactorizationModel modelRDD = sparkContext.objectFile(HDFS
path);
MatrixFactorizationModel preModel = modelData.first();
// Read Model's RDD
JavaRDDTuple2lt;Object, double[] productJavaRDD =
sparkContext.objectFile(HDFS path);
JavaRDDTuple2lt;Object, double[] userJavaRDD =
sparkContext.objectFile(HDFS path);
// Cache
productJavaRDD.cache();
userJavaRDD.cache();
// Create Model
MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
==

I could not understand why predict method was frozen.
Could you please help me how to cache object ?

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org