tensor factorization FR

2016-06-20 Thread Roberto Pagliari
There are a number of research papers about tensor factorization and its use in 
machine learning.

Is tensor factorization in the roadmap?


RBM in mllib

2016-06-14 Thread Roberto Pagliari
Is RBM being developed?

This one is marked as resolved, but it is not

https://issues.apache.org/jira/browse/SPARK-4251




access to nonnegative flag with ALS trainImplicit

2016-04-28 Thread Roberto Pagliari
I'm using ALS with mllib 1.5.2 in Scala.

I do not have access to the nonnegative flag in trainImplicit.

Which API is it available from?


Re: ALS setIntermediateRDDStorageLevel

2016-03-22 Thread Roberto Pagliari
I have and it¹s under class ALS private

On 22/03/2016 10:58, "Sean Owen"  wrote:

>No, it's been there since 1.1 and still is there:
>setIntermediateRDDStorageLevel. Double-check your code.
>
>On Mon, Mar 21, 2016 at 10:09 PM, Roberto Pagliari
> wrote:
>> According to this thread
>>
>> 
>>http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-td
>>15420.html
>>
>> There should be a function to set intermediate storage level in ALS.
>> However, I¹m getting method not found with Spark 1.6. Is it still
>>available?
>> If so, can I get to see a minimal example?
>>
>> Thank you,
>>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: cluster randomly re-starting jobs

2016-03-21 Thread Roberto Pagliari
Yes you are right. The job failed and it was re-attempting.

Thank you,


From: Daniel Siegmann 
mailto:daniel.siegm...@teamaol.com>>
Date: Monday, 21 March 2016 21:33
To: Ted Yu mailto:yuzhih...@gmail.com>>
Cc: Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: cluster randomly re-starting jobs

Never used Ambari and I don't know if this is your problem, but I have seen 
similar behavior. In my case, my application failed and Hadoop kicked off a 
second attempt. I didn't realize this, but when I refreshed the Spark UI, 
suddenly everything seemed reset! This is because the application ID is part of 
the URL, but not the attempt ID, so when the context for the second attempt 
starts it will be at the same URL as the context for the first job.

To verify if this is the problem you could look at the application in the 
Hadoop console (or whatever the equivalent is on Ambari) and see if there are 
multiple attempts. You can also see it in the Spark history server (under 
incomplete applications, if the second attempt is still running).

~Daniel Siegmann

On Mon, Mar 21, 2016 at 9:58 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:
Can you provide a bit more information ?

Release of Spark and YARN

Have you checked Spark UI / YARN job log to see if there is some clue ?

Cheers

On Mon, Mar 21, 2016 at 6:21 AM, Roberto Pagliari 
mailto:roberto.pagli...@asos.com>> wrote:
I noticed that sometimes the spark cluster seems to restart the job completely.

In the Ambari UI (where I can check jobs/stages) everything that was done up to 
a certain point is removed, and the job is restarted.

Does anyone know what the issue could be?

Thank you,





ALS setIntermediateRDDStorageLevel

2016-03-21 Thread Roberto Pagliari
According to this thread

http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-td15420.html

There should be a function to set intermediate storage level in ALS. However, 
I'm getting method not found with Spark 1.6. Is it still available? If so, can 
I get to see a minimal example?

Thank you,



cluster randomly re-starting jobs

2016-03-21 Thread Roberto Pagliari
I noticed that sometimes the spark cluster seems to restart the job completely.

In the Ambari UI (where I can check jobs/stages) everything that was done up to 
a certain point is removed, and the job is restarted.

Does anyone know what the issue could be?

Thank you,



ALS update without re-computing everything

2016-03-11 Thread Roberto Pagliari
In the current implementation of ALS with implicit feedback, when new date come 
in, it is not possible to update user/product matrices without re-computing 
everything.

Is this feature in planning or any known work around?

Thank you,



ALS trainImplicit performance

2016-02-25 Thread Roberto Pagliari
Does anyone know about the maximum number of ratings ALS was tested 
successfully?

For example, is 1 billion ratings (nonzero entries) too much for it to work 
properly?


Thank you,


caching ratigs with ALS implicit

2016-02-15 Thread Roberto Pagliari
Something not clear from the documentation is weather the ratings RDD needs to 
be cached before calling ALS trainImplicit. Would there be any performance gain?


Re: recommendations with duplicate ratings

2016-02-15 Thread Roberto Pagliari
Hi Sean,
I¹m not sure what you mean by aggregate. The input of trainImplicit is an
RDD of Ratings. 

I find it odd that duplicate ratings would mess with ALS in the implicit
case. It¹d be nice if it didn¹t.


Thank you, 

On 15/02/2016 20:49, "Sean Owen"  wrote:

>I believe you need to aggregate inputs per user-item in your call. I
>am actually not sure what happens if you don't. I think it would
>compute the factors twice and one would win, so yes I think it would
>effectively be ignored.  For implicit, that wouldn't work correctly,
>so you do need to aggregate.
>
>On Mon, Feb 15, 2016 at 8:30 PM, Roberto Pagliari
> wrote:
>> What happens when duplicate user/ratings are fed into ALS (the implicit
>> version, specifically)? Are duplicates ignored?
>>
>> I¹m asking because that would save me a distinct.
>>
>>
>>
>> Thank you,
>>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



recommendations with duplicate ratings

2016-02-15 Thread Roberto Pagliari
What happens when duplicate user/ratings are fed into ALS (the implicit 
version, specifically)? Are duplicates ignored?

I'm asking because that would save me a distinct.



Thank you,



Re: ALS rating caching

2016-02-09 Thread Roberto Pagliari
Hi Nick,
>From which version does that apply? I'm using 1.5.2

Thank you,


From: Nick Pentreath mailto:nick.pentre...@gmail.com>>
Date: Tuesday, 9 February 2016 07:02
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: ALS rating caching

In the "new" ALS intermediate RDDs (including the ratings input RDD after 
transforming to block-partitioned ratings) is cached using 
intermediateRDDStorageLevel, and you can select the final RDD storage level 
(for user and item factors) using finalRDDStorageLevel.

The old MLLIB API now calls the new ALS so the same semantics apply.

So it should not be necessary to cache the raw input RDD.

On Tue, 9 Feb 2016 at 01:48 Roberto Pagliari 
mailto:roberto.pagli...@asos.com>> wrote:
When using ALS from mllib, would it be better/recommended to cache the ratings 
RDD?

I'm asking because when predicting products for users (for example) it is 
recommended to cache product/user matrices.

Thank you,



ALS rating caching

2016-02-08 Thread Roberto Pagliari
When using ALS from mllib, would it be better/recommended to cache the ratings 
RDD?

I'm asking because when predicting products for users (for example) it is 
recommended to cache product/user matrices.

Thank you,



recommendProductsForUser for a subset of user

2016-02-02 Thread Roberto Pagliari
When using ALS, is it possible to use recommendProductsForUser for a subset of 
users?

Currently, productFeatures and userFeatures are val. Is there a workaround for 
it? Using recommendForUser repeatedly would not work in my case, since it would 
be too slow with many users.


Thank you,



is recommendProductsForUsers available in ALS?

2016-01-18 Thread Roberto Pagliari
With Spark 1.5, the following code:

from pyspark import SparkContext, SparkConf
from pyspark.mllib.recommendation import ALS, Rating
r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings = sc.parallelize([r1, r2, r3])
model = ALS.trainImplicit(ratings, 1, seed=10)

res = model.recommendProductsForUsers(2)

raises the error

---
AttributeErrorTraceback (most recent call last)
 in ()
  7 model = ALS.trainImplicit(ratings, 1, seed=10)
  8
> 9 res = model.recommendProductsForUsers(2)

AttributeError: 'MatrixFactorizationModel' object has no attribute 
'recommendProductsForUsers'

If the method is not available, is there a workaround with a large number of 
users and products?


Re: frequent itemsets

2016-01-02 Thread Roberto Pagliari
Hi Lin,
>From 1e-5 and below it crashes with me. I also developed my own program in C++ 
>(single machine, no spark) and I was able to compute all itemsets, that is, 
>support = 0.

Stack overflow definitely occur when computing frequent itemset, before 
association rule even starts. If you want, I can try generate an artificial 
dataset to share. Did you ever try with hundreds of millions of frequent 
itemsets?

With small datasets it works, but it looks like there might be issues when the 
number of combination grows.

Thanks,

From: LinChen mailto:m2linc...@outlook.com>>
Date: Saturday, 2 January 2016 14:48
To: Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: RE: frequent itemsets

Hi Roberto,
What is the minimum support threshold you set?
Could you check which stage you ran into StackOverFlow exception?

Thanks.



From: roberto.pagli...@asos.com<mailto:roberto.pagli...@asos.com>
To: yblia...@gmail.com<mailto:yblia...@gmail.com>
CC: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: frequent itemsets
Date: Sat, 2 Jan 2016 12:01:31 +

Hi Yanbo,
Unfortunately, I cannot share the data. I am using the code in the tutorial

https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

Did you ever try run it when there are hundreds of millions of co-purchases of 
at least two products?
I suspect AR does not handle that very well.

Thank you,



From: Yanbo Liang mailto:yblia...@gmail.com>>
Date: Saturday, 2 January 2016 09:03
To: Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: frequent itemsets

Hi Roberto,

Could you share your code snippet that others can help to diagnose your 
problems?



2016-01-02 7:51 GMT+08:00 Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>:
When using the frequent itemsets APIs, I'm running into stackOverflow exception 
whenever there are too many combinations to deal with and/or too many 
transactions and/or too many items.


Does anyone know how many transactions/items these APIs can deal with?


Thank you ,




Re: frequent itemsets

2016-01-02 Thread Roberto Pagliari
Hi Yanbo,
Unfortunately, I cannot share the data. I am using the code in the tutorial

https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

Did you ever try run it when there are hundreds of millions of co-purchases of 
at least two products?
I suspect AR does not handle that very well.

Thank you,



From: Yanbo Liang mailto:yblia...@gmail.com>>
Date: Saturday, 2 January 2016 09:03
To: Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: frequent itemsets

Hi Roberto,

Could you share your code snippet that others can help to diagnose your 
problems?



2016-01-02 7:51 GMT+08:00 Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>:
When using the frequent itemsets APIs, I'm running into stackOverflow exception 
whenever there are too many combinations to deal with and/or too many 
transactions and/or too many items.


Does anyone know how many transactions/items these APIs can deal with?


Thank you ,




frequent itemsets

2016-01-01 Thread Roberto Pagliari
When using the frequent itemsets APIs, I'm running into stackOverflow exception 
whenever there are too many combinations to deal with and/or too many 
transactions and/or too many items.


Does anyone know how many transactions/items these APIs can deal with?


Thank you ,



argparse with pyspark

2015-12-21 Thread Roberto Pagliari
Is argparse compatible with pyspark? If so, how do I provide parameters from 
command line? It does not seem to work the usual way.


Thank you,



ALS predictAll does not generate all the user/item ratings

2015-12-18 Thread Roberto Pagliari
I created the following data, data.file

1 1
1 2
1 3
2 4
3 5
4 6
5 7
6 1
7 2
8 8

The following code:

def parse_line(line):
tokens = line.split(' ')
return (int(tokens[0]), int(tokens[1])), 1.0

lines = sc.textFile('./data.file')
linesTest = sc.textFile('./data.file')

trainingRDD = lines.map(parse_line)\
   .map(lambda l: Rating(l[0][0], l[0][1], l[1]))

testRDD = linesTest.map(parse_line)\
   .map(lambda x: (x[0][0], x[0][1]))
rank = 5
numIterations = 5
model = ALS.trainImplicit(ratings=trainingRDD,
  rank=5,
  iterations=5)

res = model.predictAll(testRDD).collect()

for item in res: print item

produces the following output:

Rating(user=4, product=6, rating=0.6767983278562415)
Rating(user=6, product=1, rating=0.620394043421327)
Rating(user=8, product=8, rating=0.43915435032205224)
Rating(user=2, product=4, rating=0.6712931344760976)
Rating(user=1, product=2, rating=1.058575470286403)
Rating(user=1, product=1, rating=1.0710334376535875)
Rating(user=1, product=3, rating=0.7958297361341067)
Rating(user=7, product=2, rating=0.6183187594872994)
Rating(user=3, product=5, rating=0.862203908436539)
Rating(user=5, product=7, rating=0.8487787055836538)

By changing this line

res = model.predictAll(testRDD).collect()

to that

res = model.recommendProducts(1, 10)

The output is

Rating(user=1, product=2, rating=1.0664127057236918)
Rating(user=1, product=1, rating=1.054581213757793)
Rating(user=1, product=3, rating=0.7844128375421406)
Rating(user=1, product=6, rating=0.021054889001335786)
Rating(user=1, product=7, rating=0.0190815148087915)
Rating(user=1, product=8, rating=0.016932852980070745)
Rating(user=1, product=5, rating=0.005659639719215903)
Rating(user=1, product=4, rating=-0.007570583694108901)

why is that most of these ratings do not show up when using predictAll?



number of blocks in ALS/recommendation API

2015-12-17 Thread Roberto Pagliari
What is the meaning of the 'blocks' input argument in mllib ALS implementation, 
and how does that relate to the number of executors and/or size of the input 
data?


Thank you,



ALS mllib.recommendation vs ml.recommendation

2015-12-14 Thread Roberto Pagliari
Currently, there are two implementations of ALS available:
ml.recommendation.ALS
 and 
mllib.recommendation.ALS


  1.  How do they differ in terms of performance?
  2.  Am I correct to assume ml.recommendation.ALS (unlike mllib) does not 
support key-value RDDs? If so, what is the reason?


Thank you,



ALS with repeated entries

2015-12-09 Thread Roberto Pagliari
What happens with ALS when the same pair of user/item appears more than once 
with either the same ratings or different ratings?


Re: Python API Documentation Mismatch

2015-12-04 Thread Roberto Pagliari
Hi Yanbo,
You mean pyspark.mllib.recommendation right? That is the one used in the 
official tutorial.

Thank you,

From: Yanbo Liang mailto:yblia...@gmail.com>>
Date: Friday, 4 December 2015 03:17
To: Felix Cheung mailto:felixcheun...@hotmail.com>>
Cc: Roberto Pagliari 
mailto:roberto.pagli...@asos.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: Python API Documentation Mismatch

Hi Roberto,

There are two ALS available: 
ml.recommendation.ALS<http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation>
 and 
mllib.recommendation.ALS<http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.recommendation>
 .
They have different usage and methods. I know it's confusion that Spark provide 
two version of the same algorithm. I strongly recommend to use the ALS 
algorithm at ML package.

Yanbo

2015-12-04 1:31 GMT+08:00 Felix Cheung 
mailto:felixcheun...@hotmail.com>>:
Please open an issue in JIRA, thanks!





On Thu, Dec 3, 2015 at 3:03 AM -0800, "Roberto Pagliari" 
mailto:roberto.pagli...@asos.com>> wrote:

Hello,
I believe there is a mismatch between the API documentation (1.5.2) and the 
software currently available.

Not all functions mentioned here
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

are, in fact available. For example, the code below from the tutorial works

# Build the recommendation model using Alternating Least Squaresrank = 
10numIterations = 10model = ALS.train(ratings, rank, numIterations)

While the alternative shown in the API documentation will not (it will complain 
that ALS takes no arguments. Also, but inspecting the module with Python 
utilities I could not find several methods mentioned in the API docs)

>>> df = sqlContext.createDataFrame(... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 
>>> 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],... ["user", "item", 
>>> "rating"])>>> als = ALS(rank=10, maxIter=5)>>> model = als.fit(df)


Thank you,




Python API Documentation Mismatch

2015-12-03 Thread Roberto Pagliari
Hello,
I believe there is a mismatch between the API documentation (1.5.2) and the 
software currently available.

Not all functions mentioned here
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

are, in fact available. For example, the code below from the tutorial works

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

While the alternative shown in the API documentation will not (it will complain 
that ALS takes no arguments. Also, but inspecting the module with Python 
utilities I could not find several methods mentioned in the API docs)

>>> df = sqlContext.createDataFrame(
... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 
2, 5.0)],
... ["user", "item", "rating"])
>>> als = ALS(rank=10, maxIter=5)
>>> model = als.fit(df)


Thank you,



Jupyter configuration

2015-12-02 Thread Roberto Pagliari
Does anyone have a pointer to Jupyter configuration with pyspark? The current 
material on python inotebook is out of date, and jupyter ignores ipython 
profiles.

Thank you,