Clustering users according to their shopping traits
Sorry for off-topic, have not foud specific MLLib forum/ Please, advise a good overview of using clustering algorithms to group users according to user purchase and browsing history on a web site. Thanks!
Re: Spark 1.3.0: Running Pi example on YARN fails
rImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Fri, Apr 10, 2015 at 8:50 PM, Zhan Zhang wrote: > Hi Zork, > > There is some script change in spark-1.3 when starting the spark. You > can try put java-opts in your conf/ with following contents. > > -Dhdp.version=2.2.0.0–2041 > > > Please let me know whether it works or not. > > Thanks. > > Zhan Zhang > > > On Apr 10, 2015, at 7:21 AM, Zork Sail wrote: > > Many thanks. > > Yet even after setting: > > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > > in SPARK_HOME/conf/spark-defaults.conf > > does not help, I still have exactly the same error log as before (( > > On Fri, Apr 10, 2015 at 5:44 PM, Ted Yu wrote: > >> Zork: >> See http://search-hadoop.com/m/JW1q5iQhwz1 >> >> >> >> On Apr 10, 2015, at 5:08 AM, Zork Sail wrote: >> >>I have built Spark with command: >> >> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver >> -DskipTests package >> >> What is missing in this command to build it for YARN? >> >> I have also tried latest pre-built version with Hadoop support. >> In both cases I get the same errors described above. >> What else can be wrong? Maybe Spark 1.3.0 does not support Hadoop 2.6? >> >> On Fri, Apr 10, 2015 at 3:29 PM, Sean Owen wrote: >> >>> I see at least two possible problems: maybe you did not build Spark >>> for YARN, and looks like a variable hdp.version is expected in your >>> environment but not set (this isn't specific to Spark) >>> >>> On Fri, Apr 10, 2015 at 6:34 AM, Zork Sail wrote: >>> > >>> > Please help! Completely stuck trying to run Spark 1.3.0 on YARN! >>> > I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041 >>> > ` >>> > After building Spark with command: >>> > >>> > mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive >>> > -Phive-thriftserver -DskipTests package >>> > >>> > I try to run Pi example on YARN with the following command: >>> > >>> > export HADOOP_CONF_DIR=/etc/hadoop/conf >>> > /var/home2/test/spark/bin/spark-submit \ >>> > --class org.apache.spark.examples.SparkPi \ >>> > --master yarn-cluster \ >>> > --executor-memory 3G \ >>> > --num-executors 50 \ >>> > hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \ >>> > 1000 >>> > >>> > I get exceptions: `application_1427875242006_0029 failed 2 times due >>> to AM >>> > Container for appattempt_1427875242006_0029_02 exited with >>> exitCode: 1` >>> > Which in fact is `Diagnostics: Exception from >>> container-launch.`(please see >>> > log below). >>> > >>> > Application tracking url reveals the following messages: >>> > >>> > java.lang.Exception: Unknown container. Container either has not >>> started >>> > or has already completed or doesn't belong to this node at all >>> > >>> > and also: >>> > >>> > Error: Could not find or load main class >>> > org.apache.spark.deploy.yarn.ApplicationMaster >>> > >>> > I have Hadoop working fine on 4 nodes and completly at a loss how to >>> make >>> > Spark work on YARN. Please advise where to look for, any ideas would >>> be of >>> > great help, thank you! >>> > >>> > Spark assembly has been built with Hive, including Datanucleus >>> jars on >>> > classpath >>> > 15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load >>> > native-hadoop library for your platform... using builtin-java classes >>> where >>> > applicable >>> > 15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service >>> > address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/ >>> > 15/04/06 10:53:42 INFO client.RMProxy: Connecting
Spark 1.3.0: Running Pi example on YARN fails
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041 ` After building Spark with command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package I try to run Pi example on YARN with the following command: export HADOOP_CONF_DIR=/etc/hadoop/conf /var/home2/test/spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --executor-memory 3G \ --num-executors 50 \ hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \ 1000 I get exceptions: `application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_02 exited with exitCode: 1` Which in fact is `Diagnostics: Exception from container-launch.`(please see log below). Application tracking url reveals the following messages: java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all and also: Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster I have Hadoop working fine on 4 nodes and completly at a loss how to make Spark work on YARN. Please advise where to look for, any ideas would be of great help, thank you! Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/ 15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at etl-hdp-yarn.foo.bar.com/192.168.0.16:8050 15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers 15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container) 15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context for our AM 15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM container 15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/04/06 10:53:43 INFO yarn.Client: Uploading resource file:/var/home2/test/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar -> hdfs:// etl-hdp-nn1.foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0029/spark-assembly-1.3.0-hadoop2.6.0.jar 15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs:/user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar 15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment for our AM container 15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to: test 15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to: test 15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test); users with modify permissions: Set(test) 15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to ResourceManager 15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application application_1427875242006_0029 15/04/06 10:53:45 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED) 15/04/06 10:53:45 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1428317623905 final status: UNDEFINED tracking URL: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/ user: test 15/04/06 10:53:46 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED) 15/04/06 10:53:47 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED) 15/04/06 10:53:48 INFO yarn.Client: Application report for application_1427875242006_0029 (state: ACCEPTED) 15/04/06 10:53:49 INFO yarn.Client: Application report for application_1427875242006_0029 (state: FAILED) 15/04/06 10:53:49 INFO yarn.Client: client token: N/A diagnostics: Application application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_02 exited with exitCode: 1 For more detailed output, check application tracking page: http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1427875242006_0029_02_01
MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields `userId` and `productId`. I have **no product ratings**, just info on what products users have bought, that's all. So to train ALS I use: def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel ( http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$ ) This API requires `Rating` object: Rating(user: Int, product: Int, rating: Double) On the other hand documentation on `trainImplicit` tells: *Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, **preference**) pairs.* When I set rating / preferences to `1` as in: val ratings = sc.textFile(new File(dir, file).toString).map { line => val fields = line.split(",") // format: (randomNumber, Rating(userId, productId, rating)) (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0)) } val training = ratings.filter(x => x._1 < 60) .values .repartition(numPartitions) .cache() val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80) .values .repartition(numPartitions) .cache() val test = ratings.filter(x => x._1 >= 80).values.cache() And then train ALSL: val model = ALS.trainImplicit(ratings, rank, numIter) I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: val validationRmse = computeRmse(model, validation, numValidation) /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = { val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product))) val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating)) .join(data.map(x => ((x.user, x.product), x.rating))) .values math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n) } So my question is: to what value should I set `rating` in: Rating(user: Int, product: Int, rating: Double) for implicit training (in `ALS.trainImplicit` method) ? **Update** With: val alpha = 40 val lambda = 0.01 I get: Got 1895593 ratings from 17471 users on 462685 products. Training: 1136079, validation: 380495, test: 379019 RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10. RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20. RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10. RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20. The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481. baselineRmse: 0.0 testRmse: 0.7302343904091481 The best model improves the baseline by -Infinity%. Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).