Clustering users according to their shopping traits

2015-04-14 Thread Zork Sail
Sorry for off-topic, have not foud specific MLLib forum/
Please, advise a good overview of using clustering algorithms to group
users according to user purchase and browsing history on a web site.

Thanks!


Re: Spark 1.3.0: Running Pi example on YARN fails

2015-04-13 Thread Zork Sail
rImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



On Fri, Apr 10, 2015 at 8:50 PM, Zhan Zhang  wrote:

>  Hi Zork,
>
>  There is some script change in spark-1.3 when starting the spark. You
> can try put java-opts in your conf/ with following contents.
>
> -Dhdp.version=2.2.0.0–2041
>
>
>  Please let me know whether it works or not.
>
>  Thanks.
>
>  Zhan Zhang
>
>
>  On Apr 10, 2015, at 7:21 AM, Zork Sail  wrote:
>
>   Many thanks.
>
> Yet even after setting:
>
> spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
>
>  in SPARK_HOME/conf/spark-defaults.conf
>
>  does not help, I still have exactly the same error log as before ((
>
> On Fri, Apr 10, 2015 at 5:44 PM, Ted Yu  wrote:
>
>>  Zork:
>> See http://search-hadoop.com/m/JW1q5iQhwz1
>>
>>
>>
>> On Apr 10, 2015, at 5:08 AM, Zork Sail  wrote:
>>
>>I have built Spark with command:
>>
>> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
>> -DskipTests package
>>
>>  What is missing in this command to build it for YARN?
>>
>>  I have also tried latest pre-built version with Hadoop support.
>>  In both cases I get the same errors described above.
>>  What else can be wrong? Maybe Spark 1.3.0 does not support Hadoop 2.6?
>>
>> On Fri, Apr 10, 2015 at 3:29 PM, Sean Owen  wrote:
>>
>>> I see at least two possible problems: maybe you did not build Spark
>>> for YARN, and looks like a variable hdp.version is expected in your
>>> environment but not set (this isn't specific to Spark)
>>>
>>> On Fri, Apr 10, 2015 at 6:34 AM, Zork Sail  wrote:
>>> >
>>> > Please help! Completely stuck trying to run Spark 1.3.0 on YARN!
>>> > I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041
>>> > `
>>> > After building Spark with command:
>>> >
>>> > mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
>>> > -Phive-thriftserver -DskipTests package
>>> >
>>> > I try to run Pi example on YARN with the following command:
>>> >
>>> > export HADOOP_CONF_DIR=/etc/hadoop/conf
>>> > /var/home2/test/spark/bin/spark-submit \
>>> > --class org.apache.spark.examples.SparkPi \
>>> > --master yarn-cluster \
>>> > --executor-memory 3G \
>>> > --num-executors 50 \
>>> > hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
>>> > 1000
>>> >
>>> > I get exceptions: `application_1427875242006_0029 failed 2 times due
>>> to AM
>>> > Container for appattempt_1427875242006_0029_02 exited with
>>> exitCode: 1`
>>> > Which in fact is `Diagnostics: Exception from
>>> container-launch.`(please see
>>> > log below).
>>> >
>>> > Application tracking url reveals the following messages:
>>> >
>>> > java.lang.Exception: Unknown container. Container either has not
>>> started
>>> > or has already completed or doesn't belong to this node at all
>>> >
>>> > and also:
>>> >
>>> > Error: Could not find or load main class
>>> > org.apache.spark.deploy.yarn.ApplicationMaster
>>> >
>>> > I have Hadoop working fine on 4 nodes and completly at a loss how to
>>> make
>>> > Spark work on YARN. Please advise where to look for, any ideas would
>>> be of
>>> > great help, thank you!
>>> >
>>> > Spark assembly has been built with Hive, including Datanucleus
>>> jars on
>>> > classpath
>>> > 15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
>>> > native-hadoop library for your platform... using builtin-java classes
>>> where
>>> > applicable
>>> > 15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
>>> > address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/
>>> > 15/04/06 10:53:42 INFO client.RMProxy: Connecting 

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork Sail
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041
`
After building Spark with command:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package

I try to run Pi example on YARN with the following command:

export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
1000

I get exceptions: `application_1427875242006_0029 failed 2 times due to AM
Container for appattempt_1427875242006_0029_02 exited with  exitCode:
1` Which in fact is `Diagnostics: Exception from container-launch.`(please
see log below).

Application tracking url reveals the following messages:

java.lang.Exception: Unknown container. Container either has not
started or has already completed or doesn't belong to this node at all

and also:

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster

I have Hadoop working fine on 4 nodes and completly at a loss how to make
Spark work on YARN. Please advise where to look for, any ideas would be of
great help, thank you!

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/04/06 10:53:40 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
15/04/06 10:53:42 INFO impl.TimelineClientImpl: Timeline service
address: http://etl-hdp-yarn.foo.bar.com:8188/ws/v1/timeline/
15/04/06 10:53:42 INFO client.RMProxy: Connecting to ResourceManager at
etl-hdp-yarn.foo.bar.com/192.168.0.16:8050
15/04/06 10:53:42 INFO yarn.Client: Requesting a new application from
cluster with 4 NodeManagers
15/04/06 10:53:42 INFO yarn.Client: Verifying our application has not
requested more than the maximum memory capability of the cluster (4096 MB
per container)
15/04/06 10:53:42 INFO yarn.Client: Will allocate AM container, with
896 MB memory including 384 MB overhead
15/04/06 10:53:42 INFO yarn.Client: Setting up container launch context
for our AM
15/04/06 10:53:42 INFO yarn.Client: Preparing resources for our AM
container
15/04/06 10:53:43 WARN shortcircuit.DomainSocketFactory: The
short-circuit local reads feature cannot be used because libhadoop cannot
be loaded.
15/04/06 10:53:43 INFO yarn.Client: Uploading resource
file:/var/home2/test/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar
-> hdfs://
etl-hdp-nn1.foo.bar.com:8020/user/test/.sparkStaging/application_1427875242006_0029/spark-assembly-1.3.0-hadoop2.6.0.jar
15/04/06 10:53:44 INFO yarn.Client: Source and destination file systems
are the same. Not copying
hdfs:/user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar
15/04/06 10:53:44 INFO yarn.Client: Setting up the launch environment
for our AM container
15/04/06 10:53:44 INFO spark.SecurityManager: Changing view acls to:
test
15/04/06 10:53:44 INFO spark.SecurityManager: Changing modify acls to:
test
15/04/06 10:53:44 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(test); users with modify permissions: Set(test)
15/04/06 10:53:44 INFO yarn.Client: Submitting application 29 to
ResourceManager
15/04/06 10:53:44 INFO impl.YarnClientImpl: Submitted application
application_1427875242006_0029
15/04/06 10:53:45 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:45 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1428317623905
 final status: UNDEFINED
 tracking URL:
http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/
 user: test
15/04/06 10:53:46 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:47 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:48 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: ACCEPTED)
15/04/06 10:53:49 INFO yarn.Client: Application report for
application_1427875242006_0029 (state: FAILED)
15/04/06 10:53:49 INFO yarn.Client:
 client token: N/A
 diagnostics: Application application_1427875242006_0029 failed 2
times due to AM Container for appattempt_1427875242006_0029_02 exited
with  exitCode: 1
For more detailed output, check application tracking page:
http://etl-hdp-yarn.foo.bar.com:8088/proxy/application_1427875242006_0029/Then,
click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1427875242006_0029_02_01
   

MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Zork Sail
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought,
that's all. So to train ALS I use:

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):
MatrixFactorizationModel

(
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$
)

This API requires `Rating` object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a matrix
factorization model given an RDD of 'implicit preferences' ratings given by
users to some products, in the form of (userID, productID, **preference**)
pairs.*

When I set rating / preferences to `1` as in:

val ratings = sc.textFile(new File(dir, file).toString).map { line =>
  val fields = line.split(",")
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x => x._1 < 60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()


And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n:
Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user,
x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product),
x.rating))
  .join(data.map(x => ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set `rating` in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?

**Update**

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank
= 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank
= 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank
= 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank
= 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline
improvement where baseline model is simply mean (1).