Clustering users according to their shopping traits

2015-04-14 Thread Zork Sail
Sorry for off-topic, have not foud specific MLLib forum/
Please, advise a good overview of using clustering algorithms to group
users according to user purchase and browsing history on a web site.

Thanks!


Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zork Sail
:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



On Fri, Apr 10, 2015 at 8:50 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Hi Zork,

  There is some script change in spark-1.3 when starting the spark. You
 can try put java-opts in your conf/ with following contents.

 -Dhdp.version=2.2.0.0–2041


  Please let me know whether it works or not.

  Thanks.

  Zhan Zhang


  On Apr 10, 2015, at 7:21 AM, Zork Sail zorks...@gmail.com wrote:

   Many thanks.

 Yet even after setting:

 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

  in SPARK_HOME/conf/spark-defaults.conf

  does not help, I still have exactly the same error log as before ((

 On Fri, Apr 10, 2015 at 5:44 PM, Ted Yu yuzhih...@gmail.com wrote:

  Zork:
 See http://search-hadoop.com/m/JW1q5iQhwz1



 On Apr 10, 2015, at 5:08 AM, Zork Sail zorks...@gmail.com wrote:

I have built Spark with command:

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
 -DskipTests package

  What is missing in this command to build it for YARN?

  I have also tried latest pre-built version with Hadoop support.
  In both cases I get the same errors described above.
  What else can be wrong? Maybe Spark 1.3.0 does not support Hadoop 2.6?

 On Fri, Apr 10, 2015 at 3:29 PM, Sean Owen so...@cloudera.com wrote:

 I see at least two possible problems: maybe you did not build Spark
 for YARN, and looks like a variable hdp.version is expected in your
 environment but not set (this isn't specific to Spark)

 On Fri, Apr 10, 2015 at 6:34 AM, Zork Sail zorks...@gmail.com wrote:
 
  Please help! Completely stuck trying to run Spark 1.3.0 on YARN!
  I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041
  `
  After building Spark with command:
 
  mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
  -Phive-thriftserver -DskipTests package
 
  I try to run Pi example on YARN with the following command:
 
  export HADOOP_CONF_DIR=/etc/hadoop/conf
  /var/home2/test/spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \
  --executor-memory 3G \
  --num-executors 50 \
  hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
  1000
 
  I get exceptions: `application_1427875242006_0029 failed 2 times due
 to AM
  Container for appattempt_1427875242006_0029_02 exited with
 exitCode: 1`
  Which in fact is `Diagnostics: Exception from
 container-launch.`(please see
  log below).
 
  Application tracking url reveals the following messages:
 
  java.lang.Exception: Unknown container. Container either has not
 started
  or has already completed or doesn't belong to this node at all
 
  and also:
 
  Error: Could not find or load main class
  org.apache.spark.deploy.yarn.ApplicationMaster
 
  I have Hadoop working fine on 4 nodes and completly at a loss how to
 make
  Spark work on YARN. Please advise where to look for, any ideas would
 be of
  great help, thank you!
 
  Spark assembly has been built with Hive, including Datanucleus
 jars on
  classpath
  15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
  native-hadoop library for your platform... using builtin-java classes
 where
  applicable
  15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
  address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/
  15/04/06 10:53:42 INFO client.RMProxy: Connecting to
 ResourceManager at
  etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
  15/04/06 10:53:42 INFO yarn.Client: Requesting a new application
 from
  cluster with 4 NodeManagers
  15/04/06 10:53:42 INFO yarn.Client: Verifying our application has
 not
  requested more than the maximum memory capability of the cluster (4096
 MB
  per container)
  15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container,
 with 896
  MB memory including 384 MB overhead
  15/04/06 10:53:42 INFO yarn.Client: Setting up container launch
 context
  for our AM
  15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM
  container
  15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The
  short-circuit local reads feature cannot be used because libhadoop
 cannot be
  loaded.
  15/04/06 10:53:43 INFO yarn.Client: Uploading resource
 
 file:/var/home2/test/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar
  -
  hdfs://
 etl-hdp-nn1.foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0029/spark-assembly-1.3.0-hadoop2.6.0.jar

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork Sail
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041
`
After building Spark with command:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package

I try to run Pi example on YARN with the following command:

export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
1000

I get exceptions: `application_1427875242006_0029 failed 2 times due to AM
Container for appattempt_1427875242006_0029_02 exited with  exitCode:
1` Which in fact is `Diagnostics: Exception from container-launch.`(please
see log below).

Application tracking url reveals the following messages:

java.lang.Exception: Unknown container. Container either has not
started or has already completed or doesn't belong to this node at all

and also:

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster

I have Hadoop working fine on 4 nodes and completly at a loss how to make
Spark work on YARN. Please advise where to look for, any ideas would be of
great help, thank you!

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/
15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at
etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from
cluster with 4 NodeManagers
15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not
requested more than the maximum memory capability of the cluster (4096 MB
per container)
15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with
896 MB memory including 384 MB overhead
15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context
for our AM
15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM
container
15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The
short-circuit local reads feature cannot be used because libhadoop cannot
be loaded.
15/04/06 10:53:43 INFO yarn.Client: Uploading resource
file:/var/home2/test/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar
- hdfs://
etl-hdp-nn1.foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0029/spark-assembly-1.3.0-hadoop2.6.0.jar
15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems
are the same. Not copying
hdfs:/user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment
for our AM container
15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to:
test
15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to:
test
15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(test); users with modify permissions: Set(test)
15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to
ResourceManager
15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application
application_1427875242006_0029
15/04/06 10:53:45 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:45 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1428317623905
 final status: UNDEFINED
 tracking URL:
http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/
 user: test
15/04/06 10:53:46 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:47 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:48 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:49 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: FAILED)
15/04/06 10:53:49 INFO yarn.Client:
 client token: N/A
 diagnostics: Application application_1427875242006_0029 failed 2
times due to AM Container for appattempt_1427875242006_0029_02 exited
with  exitCode: 1
For more detailed output, check application tracking page:
http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/Then,
click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0029_02_01

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork Sail
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought,
that's all. So to train ALS I use:

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):
MatrixFactorizationModel

(
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$
)

This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a matrix
factorization model given an RDD of 'implicit preferences' ratings given by
users to some products, in the form of (userID, productID, **preference**)
pairs.*

When I set rating / preferences to `1` as in:

val ratings = sc.textFile(new File(dir, file).toString).map { line =
  val fields = line.split(,)
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x = x._1  60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x = x._1 = 60  x._1  80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x = x._1 = 80).values.cache()


And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n:
Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x = (x.user,
x.product)))
val predictionsAndRatings = predictions.map(x = ((x.user, x.product),
x.rating))
  .join(data.map(x = ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?

**Update**

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank
= 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank
= 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank
= 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank
= 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline
improvement where baseline model is simply mean (1).