How to avoid being killed by YARN node manager ?
Hello. We use ALS(Collaborative filtering) of Spark MLlib on YARN. Spark version is 1.2.0 included CDH 5.3.1. 1,000,000,000 records(5,000,000 users data and 5,000,000 items data) are used for machine learning with ALS. These large quantities of data increases virtual memory usage, node manager of YARN kills Spark worker process. Even though Spark run again after killing process, Spark worker process is killed again. As a result, the whole Spark processes are terminated. # Spark worker process is killed, it seems that virtual memory usage increased by # 'Shuffle' or 'Disk writing' gets over the threshold of YARN. To avoid such a case from occurring, we use the method that 'yarn.nodemanager.vmem-check-enabled' is false, then exit successfully. But it does not seem to have an appropriate way. If you know, please let me know about tuning method of Spark. The conditions of machines and Spark settings are as follows. 1)six machines, physical memory is 32GB of each machine. 2)Spark settings - spark.executor.memory=16g - spark.closure.serializer=org.apache.spark.serializer.KryoSerializer - spark.rdd.compress=true - spark.shuffle.memoryFraction=0.4 Thanks, Yuichiro Sakamoto -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-being-killed-by-YARN-node-manager-tp22199.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't cache RDD of collaborative filtering on MLlib
I got answer from mail posted to ML. --- Summary --- cache() is lazy, so you can use `RDD.count()` explicitly to load into memory. --- And I tried, two RDDs were cached and the speed became faster. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962p22000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't cache RDD of collaborative filtering on MLlib
Thank you for your reply. 1. Which version of Spark do you use now? I use Spark 1.2.0. (CDH 5.3.1) 2. Why don't you check whether `productJavaRDD ` and `userJavaRDD ` are cached with Web UI or not? I checked SparkUI. Task was stopped at 1/2 (Succeeded/Total tasks). Here is stacktrace: org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:840) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendUsers(MatrixFactorizationModel.scala:126) ... 3. You can check a RDD cached or not with `getStorageLevel()` method, such as `model .userFeatures.getStorageLevel()`. I printed the return value of getStorageLevel() userFeatures and productFeatures, both were Memory Deserialized 1x Replicated . I think, two variables were configured to cache, but didn't cache at that time. (delayed ?) Thanks, Yuichiro Sakamoto -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962p21990.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Can't cache RDD of collaborative filtering on MLlib
Hello. I create program, collaborative filtering using Spark, but I have trouble with calculating speed. I want to implement recommendation program using ALS (MLlib), which is another process from Spark. But access speed of MatrixFactorizationModel object on HDFS is slow, so I want to cache it, but I can't. There are 2 processes: process A: 1. Create MatrixFactorizationModel by ALS 2. Save following objects to HDFS - MatrixFactorizationModel (on RDD) - MatrixFactorizationModel#userFeatures(RDD) - MatrixFactorizationModel#productFeatures(RDD) process B: 1. Load model information saved by process A. # In process B, Master of SparkContext is set to local == // Read Model JavaRDDMatrixFactorizationModel modelRDD = sparkContext.objectFile(HDFS path); MatrixFactorizationModel preModel = modelData.first(); // Read Model's RDD JavaRDDTuple2lt;Object, double[] productJavaRDD = sparkContext.objectFile(HDFS path); JavaRDDTuple2lt;Object, double[] userJavaRDD = sparkContext.objectFile(HDFS path); // Create Model MatrixFactorizationModel model = new MatrixFactorizationModel(preModel.rank(), JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD)); == 2. Call predict method of above MatrixFactorizationModel object. At number 2 of process B, it is slow speed because objects are read from HDFS every time. # I confirmed that the result of recommendation is correct. So, I tried to cache productJavaRDD and userJavaRDD as following, but there was no response from predict method. == // Read Model JavaRDDMatrixFactorizationModel modelRDD = sparkContext.objectFile(HDFS path); MatrixFactorizationModel preModel = modelData.first(); // Read Model's RDD JavaRDDTuple2lt;Object, double[] productJavaRDD = sparkContext.objectFile(HDFS path); JavaRDDTuple2lt;Object, double[] userJavaRDD = sparkContext.objectFile(HDFS path); // Cache productJavaRDD.cache(); userJavaRDD.cache(); // Create Model MatrixFactorizationModel model = new MatrixFactorizationModel(preModel.rank(), JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD)); == I could not understand why predict method was frozen. Could you please help me how to cache object ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org