Clustering users according to their shopping traits

2015-04-14 Thread Zork Sail
Sorry for off-topic, have not foud specific MLLib forum/
Please, advise a good overview of using clustering algorithms to group
users according to user purchase and browsing history on a web site.


Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zork Sail
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On Fri, Apr 10, 2015 at 8:50 PM, Zhan Zhang wrote:

  Hi Zork,

  There is some script change in spark-1.3 when starting the spark. You
 can try put java-opts in your conf/ with following contents.


  Please let me know whether it works or not.


  Zhan Zhang

  On Apr 10, 2015, at 7:21 AM, Zork Sail wrote:

   Many thanks.

 Yet even after setting:

 spark.driver.extraJavaOptions -Dhdp.version=–2041 -Dhdp.version=–2041

  in SPARK_HOME/conf/spark-defaults.conf

  does not help, I still have exactly the same error log as before ((

 On Fri, Apr 10, 2015 at 5:44 PM, Ted Yu wrote:


 On Apr 10, 2015, at 5:08 AM, Zork Sail wrote:

I have built Spark with command:

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
 -DskipTests package

  What is missing in this command to build it for YARN?

  I have also tried latest pre-built version with Hadoop support.
  In both cases I get the same errors described above.
  What else can be wrong? Maybe Spark 1.3.0 does not support Hadoop 2.6?

 On Fri, Apr 10, 2015 at 3:29 PM, Sean Owen wrote:

 I see at least two possible problems: maybe you did not build Spark
 for YARN, and looks like a variable hdp.version is expected in your
 environment but not set (this isn't specific to Spark)

 On Fri, Apr 10, 2015 at 6:34 AM, Zork Sail wrote:
  Please help! Completely stuck trying to run Spark 1.3.0 on YARN!
  I have `Hadoop` with `Hive
  After building Spark with command:
  mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
  -Phive-thriftserver -DskipTests package
  I try to run Pi example on YARN with the following command:
  export HADOOP_CONF_DIR=/etc/hadoop/conf
  /var/home2/test/spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \
  --executor-memory 3G \
  --num-executors 50 \
  hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
  I get exceptions: `application_1427875242006_0029 failed 2 times due
 to AM
  Container for appattempt_1427875242006_0029_02 exited with
 exitCode: 1`
  Which in fact is `Diagnostics: Exception from
 container-launch.`(please see
  log below).
  Application tracking url reveals the following messages:
  java.lang.Exception: Unknown container. Container either has not
  or has already completed or doesn't belong to this node at all
  and also:
  Error: Could not find or load main class
  I have Hadoop working fine on 4 nodes and completly at a loss how to
  Spark work on YARN. Please advise where to look for, any ideas would
 be of
  great help, thank you!
  Spark assembly has been built with Hive, including Datanucleus
 jars on
  15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
  native-hadoop library for your platform... using builtin-java classes
  15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
  15/04/06 10:53:42 INFO client.RMProxy: Connecting to
 ResourceManager at
  15/04/06 10:53:42 INFO yarn.Client: Requesting a new application
  cluster with 4 NodeManagers
  15/04/06 10:53:42 INFO yarn.Client: Verifying our application has
  requested more than the maximum memory capability of the cluster (4096
  per container)
  15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container,
 with 896
  MB memory including 384 MB overhead
  15/04/06 10:53:42 INFO yarn.Client: Setting up container launch
  for our AM
  15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM
  15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The
  short-circuit local reads feature cannot be used because libhadoop
 cannot be
  15/04/06 10:53:43 INFO yarn.Client: Uploading resource

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork Sail
I have `Hadoop` with `Hive
After building Spark with command:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package

I try to run Pi example on YARN with the following command:

export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \

I get exceptions: `application_1427875242006_0029 failed 2 times due to AM
Container for appattempt_1427875242006_0029_02 exited with  exitCode:
1` Which in fact is `Diagnostics: Exception from container-launch.`(please
see log below).

Application tracking url reveals the following messages:

java.lang.Exception: Unknown container. Container either has not
started or has already completed or doesn't belong to this node at all

and also:

Error: Could not find or load main class

I have Hadoop working fine on 4 nodes and completly at a loss how to make
Spark work on YARN. Please advise where to look for, any ideas would be of
great help, thank you!

Spark assembly has been built with Hive, including Datanucleus jars on
15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at
15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from
cluster with 4 NodeManagers
15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not
requested more than the maximum memory capability of the cluster (4096 MB
per container)
15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with
896 MB memory including 384 MB overhead
15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context
for our AM
15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM
15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The
short-circuit local reads feature cannot be used because libhadoop cannot
be loaded.
15/04/06 10:53:43 INFO yarn.Client: Uploading resource
- hdfs://
15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems
are the same. Not copying
15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment
for our AM container
15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to:
15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to:
15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(test); users with modify permissions: Set(test)
15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to
15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application
15/04/06 10:53:45 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:45 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1428317623905
 final status: UNDEFINED
 tracking URL:
 user: test
15/04/06 10:53:46 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:47 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:48 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:49 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: FAILED)
15/04/06 10:53:49 INFO yarn.Client:
 client token: N/A
 diagnostics: Application application_1427875242006_0029 failed 2
times due to AM Container for appattempt_1427875242006_0029_02 exited
with  exitCode: 1
For more detailed output, check application tracking page:,
click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0029_02_01

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork
I have `Hadoop` with `Hive
After building Spark with command:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package

I try to run Pi example on YARN with the following command:

export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
I get exceptions: `application_1427875242006_0029 failed 2 times due to AM
Container for appattempt_1427875242006_0029_02 exited with  exitCode: 1`
Which in fact is `Diagnostics: Exception from container-launch.`(please see
log below).

Application tracking url reveals the following messages:

java.lang.Exception: Unknown container. Container either has not started
or has already completed or doesn't belong to this node at all

and also:

Error: Could not find or load main class

I have Hadoop working fine on 4 nodes and completly at a loss how to make
Spark work on YARN. Please advise where to look for, any ideas would be of
great help, thank you!

Spark assembly has been built with Hive, including Datanucleus jars on
15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at
15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from
cluster with 4 NodeManagers
15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not
requested more than the maximum memory capability of the cluster (4096 MB
per container)
15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with 896
MB memory including 384 MB overhead
15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context
for our AM
15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM
15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The
short-circuit local reads feature cannot be used because libhadoop cannot be
15/04/06 10:53:43 INFO yarn.Client: Uploading resource
15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems
are the same. Not copying
15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment
for our AM container
15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to:
15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to:
15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(test); users with modify permissions: Set(test)
15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to
15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application
15/04/06 10:53:45 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:45 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1428317623905
 final status: UNDEFINED
 tracking URL:
 user: test
15/04/06 10:53:46 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:47 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:48 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:49 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: FAILED)
15/04/06 10:53:49 INFO yarn.Client:
 client token: N/A
 diagnostics: Application application_1427875242006_0029 failed 2
times due to AM Container for appattempt_1427875242006_0029_02 exited
with  exitCode: 1
For more detailed output, check application tracking
click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0029_02_01

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork Sail
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought,
that's all. So to train ALS I use:

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):


This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a matrix
factorization model given an RDD of 'implicit preferences' ratings given by
users to some products, in the form of (userID, productID, **preference**)

When I set rating / preferences to `1` as in:

val ratings = sc.textFile(new File(dir, file).toString).map { line =
  val fields = line.split(,)
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))

 val training = ratings.filter(x = x._1  60)
val validation = ratings.filter(x = x._1 = 60  x._1  80)
val test = ratings.filter(x = x._1 = 80).values.cache()

And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n:
Long): Double = {
val predictions: RDD[Rating] = model.predict( = (x.user,
val predictionsAndRatings = = ((x.user, x.product),
  .join( = ((x.user, x.product), x.rating)))
math.sqrt( = (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?



  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank
= 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank
= 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank
= 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank
= 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline
improvement where baseline model is simply mean (1).

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork
I am trying to use Spark MLib ALS with implicit feedback for
collaborative filtering. Input data has only two fields `userId` and
`productId`. I have **no product ratings**, just info on what products users
have bought, that's all. So to train ALS I use:
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):


This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a
matrix factorization model given an RDD of 'implicit preferences' ratings
given by users to some products, in the form of (userID, productID,
**preference**) pairs.*
When I set rating / preferences to `1` as in:
val ratings = sc.textFile(new File(dir, file).toString).map { line
  val fields = line.split(,)
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))

 val training = ratings.filter(x = x._1  60)
val validation = ratings.filter(x = x._1 = 60  x._1  80)
val test = ratings.filter(x = x._1 = 80).values.cache()

And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or
1 value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating],
n: Long): Double = {
val predictions: RDD[Rating] = model.predict( = (x.user,
val predictionsAndRatings = = ((x.user,
x.product), x.rating))
  .join( = ((x.user, x.product), x.rating)))
math.sqrt( = (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?



  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with
rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with
rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with
rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with
rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline
improvement where baseline model is simply mean (1).

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail: