MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi,

I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
the code fails on userFeatures.lookup(user).head

In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
called and in all the test-cases that API has been used...

I can perhaps refactor my code to do the same but I was wondering whether
people test the lookup(user) version of the code..

Do I need to cache the model to make it work ? I think right now default is
STORAGE_AND_DISK...

Thanks.
Deb


Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Xiangrui Meng
Was "user" presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui

On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das  wrote:
> Hi,
>
> I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
> the code fails on userFeatures.lookup(user).head
>
> In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
> called and in all the test-cases that API has been used...
>
> I can perhaps refactor my code to do the same but I was wondering whether
> people test the lookup(user) version of the code..
>
> Do I need to cache the model to make it work ? I think right now default is
> STORAGE_AND_DISK...
>
> Thanks.
> Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
I reproduced the problem in mllib tests ALSSuite.scala using the following
functions:

val arrayPredict = userProductsRDD.map{case(user,product) =>

 val recommendedProducts = model.recommendProducts(user, products)

 val productScore = recommendedProducts.find{x=>x.product == product
}

  require(productScore != None)

  productScore.get

}.collect

arrayPredict.foreach { elem =>

  if (allRatings.get(elem.user, elem.product) != elem.rating)

  fail("Prediction APIs don't match")

}

If the usage of model.recommendProducts is correct, the test fails with the
same error I sent before...

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
316.0 (TID 79, localhost): scala.MatchError: null

org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)

It is a blocker for me and I am debugging it. I will open up a JIRA if this
is indeed a bug...

Do I have to cache the models to make userFeatures.lookup(user).head to
work ?

On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng  wrote:

> Was "user" presented in training? We can put a check there and return
> NaN if the user is not included in the model. -Xiangrui
>
> On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das 
> wrote:
> > Hi,
> >
> > I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
> but
> > the code fails on userFeatures.lookup(user).head
> >
> > In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
> > called and in all the test-cases that API has been used...
> >
> > I can perhaps refactor my code to do the same but I was wondering whether
> > people test the lookup(user) version of the code..
> >
> > Do I need to cache the model to make it work ? I think right now default
> is
> > STORAGE_AND_DISK...
> >
> > Thanks.
> > Deb
>


Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
ALS model contains RDDs. So you cannot put `model.recommendProducts`
inside a RDD closure `userProductsRDD.map`. -Xiangrui

On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das  wrote:
> I reproduced the problem in mllib tests ALSSuite.scala using the following
> functions:
>
> val arrayPredict = userProductsRDD.map{case(user,product) =>
>
>  val recommendedProducts = model.recommendProducts(user, products)
>
>  val productScore = recommendedProducts.find{x=>x.product ==
> product}
>
>   require(productScore != None)
>
>   productScore.get
>
> }.collect
>
> arrayPredict.foreach { elem =>
>
>   if (allRatings.get(elem.user, elem.product) != elem.rating)
>
>   fail("Prediction APIs don't match")
>
> }
>
> If the usage of model.recommendProducts is correct, the test fails with the
> same error I sent before...
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 316.0 (TID 79, localhost): scala.MatchError: null
>
> org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
>
> It is a blocker for me and I am debugging it. I will open up a JIRA if this
> is indeed a bug...
>
> Do I have to cache the models to make userFeatures.lookup(user).head to work
> ?
>
>
> On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng  wrote:
>>
>> Was "user" presented in training? We can put a check there and return
>> NaN if the user is not included in the model. -Xiangrui
>>
>> On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das 
>> wrote:
>> > Hi,
>> >
>> > I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
>> > but
>> > the code fails on userFeatures.lookup(user).head
>> >
>> > In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
>> > been
>> > called and in all the test-cases that API has been used...
>> >
>> > I can perhaps refactor my code to do the same but I was wondering
>> > whether
>> > people test the lookup(user) version of the code..
>> >
>> > Do I need to cache the model to make it work ? I think right now default
>> > is
>> > STORAGE_AND_DISK...
>> >
>> > Thanks.
>> > Deb
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
model.recommendProducts can only be called from the master then ? I have a
set of 20% users on whom I am performing the test...the 20% users are in a
RDD...if I have to collect them all to master node and then call
model.recommendProducts, that's a issue...

Any idea how to optimize this so that we can calculate MAP statistics on
large samples of data ?


On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng  wrote:

> ALS model contains RDDs. So you cannot put `model.recommendProducts`
> inside a RDD closure `userProductsRDD.map`. -Xiangrui
>
> On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das 
> wrote:
> > I reproduced the problem in mllib tests ALSSuite.scala using the
> following
> > functions:
> >
> > val arrayPredict = userProductsRDD.map{case(user,product) =>
> >
> >  val recommendedProducts = model.recommendProducts(user,
> products)
> >
> >  val productScore = recommendedProducts.find{x=>x.product ==
> > product}
> >
> >   require(productScore != None)
> >
> >   productScore.get
> >
> > }.collect
> >
> > arrayPredict.foreach { elem =>
> >
> >   if (allRatings.get(elem.user, elem.product) != elem.rating)
> >
> >   fail("Prediction APIs don't match")
> >
> > }
> >
> > If the usage of model.recommendProducts is correct, the test fails with
> the
> > same error I sent before...
> >
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in
> > stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> > 316.0 (TID 79, localhost): scala.MatchError: null
> >
> > org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
> >
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
> >
> > It is a blocker for me and I am debugging it. I will open up a JIRA if
> this
> > is indeed a bug...
> >
> > Do I have to cache the models to make userFeatures.lookup(user).head to
> work
> > ?
> >
> >
> > On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng  wrote:
> >>
> >> Was "user" presented in training? We can put a check there and return
> >> NaN if the user is not included in the model. -Xiangrui
> >>
> >> On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das 
> >> wrote:
> >> > Hi,
> >> >
> >> > I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
> >> > but
> >> > the code fails on userFeatures.lookup(user).head
> >> >
> >> > In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
> >> > been
> >> > called and in all the test-cases that API has been used...
> >> >
> >> > I can perhaps refactor my code to do the same but I was wondering
> >> > whether
> >> > people test the lookup(user) version of the code..
> >> >
> >> > Do I need to cache the model to make it work ? I think right now
> default
> >> > is
> >> > STORAGE_AND_DISK...
> >> >
> >> > Thanks.
> >> > Deb
> >
> >
>


Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066

The easiest case is when one side is small. If both sides are large,
this is a super-expensive operation. We can do block-wise cross
product and then find top-k for each user.

Best,
Xiangrui

On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das  wrote:
> model.recommendProducts can only be called from the master then ? I have a
> set of 20% users on whom I am performing the test...the 20% users are in a
> RDD...if I have to collect them all to master node and then call
> model.recommendProducts, that's a issue...
>
> Any idea how to optimize this so that we can calculate MAP statistics on
> large samples of data ?
>
>
> On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng  wrote:
>>
>> ALS model contains RDDs. So you cannot put `model.recommendProducts`
>> inside a RDD closure `userProductsRDD.map`. -Xiangrui
>>
>> On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das 
>> wrote:
>> > I reproduced the problem in mllib tests ALSSuite.scala using the
>> > following
>> > functions:
>> >
>> > val arrayPredict = userProductsRDD.map{case(user,product) =>
>> >
>> >  val recommendedProducts = model.recommendProducts(user,
>> > products)
>> >
>> >  val productScore = recommendedProducts.find{x=>x.product ==
>> > product}
>> >
>> >   require(productScore != None)
>> >
>> >   productScore.get
>> >
>> > }.collect
>> >
>> > arrayPredict.foreach { elem =>
>> >
>> >   if (allRatings.get(elem.user, elem.product) != elem.rating)
>> >
>> >   fail("Prediction APIs don't match")
>> >
>> > }
>> >
>> > If the usage of model.recommendProducts is correct, the test fails with
>> > the
>> > same error I sent before...
>> >
>> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> > 0 in
>> > stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>> > 316.0 (TID 79, localhost): scala.MatchError: null
>> >
>> > org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
>> >
>> > org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
>> >
>> > It is a blocker for me and I am debugging it. I will open up a JIRA if
>> > this
>> > is indeed a bug...
>> >
>> > Do I have to cache the models to make userFeatures.lookup(user).head to
>> > work
>> > ?
>> >
>> >
>> > On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng  wrote:
>> >>
>> >> Was "user" presented in training? We can put a check there and return
>> >> NaN if the user is not included in the model. -Xiangrui
>> >>
>> >> On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das 
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I am testing MatrixFactorizationModel.predict(user: Int, product:
>> >> > Int)
>> >> > but
>> >> > the code fails on userFeatures.lookup(user).head
>> >> >
>> >> > In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
>> >> > been
>> >> > called and in all the test-cases that API has been used...
>> >> >
>> >> > I can perhaps refactor my code to do the same but I was wondering
>> >> > whether
>> >> > people test the lookup(user) version of the code..
>> >> >
>> >> > Do I need to cache the model to make it work ? I think right now
>> >> > default
>> >> > is
>> >> > STORAGE_AND_DISK...
>> >> >
>> >> > Thanks.
>> >> > Deb
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-10 Thread Debasish Das
I tested 2 different implementations to generate the predicted ranked
list...The first version uses a cartesian of user and product features and
then generates a predicted value for each (user,product) key...

The second version does a collect on the skinny matrix (most likely
products) and then broadcasts it to every node which computes the predicted
value...

cartesian is slower than the broadcast version...but in the broadcast
version also the shuffle time is significant..Bottleneck is the groupBy on
(user,product) composite key followed by local sort to generate topK...

The third version I thought of was to use topK predict API but this works
only if topK is bounded by a small number...If topK is large (say 100K) it
does not work since then it is bounded by master memory...

The block-wise cross product idea will optimize the groupBy right ? we
break user and feature matrices into blocks (re-use ALS partitioning) and
then in place of using (user,product) as a key use (userBlock,
productBlock) as key...Does this help improve in shuffle size ?


On Thu, Nov 6, 2014 at 5:07 PM, Xiangrui Meng  wrote:

> There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066
>
> The easiest case is when one side is small. If both sides are large,
> this is a super-expensive operation. We can do block-wise cross
> product and then find top-k for each user.
>
> Best,
> Xiangrui
>
> On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das 
> wrote:
> > model.recommendProducts can only be called from the master then ? I have
> a
> > set of 20% users on whom I am performing the test...the 20% users are in
> a
> > RDD...if I have to collect them all to master node and then call
> > model.recommendProducts, that's a issue...
> >
> > Any idea how to optimize this so that we can calculate MAP statistics on
> > large samples of data ?
> >
> >
> > On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng  wrote:
> >>
> >> ALS model contains RDDs. So you cannot put `model.recommendProducts`
> >> inside a RDD closure `userProductsRDD.map`. -Xiangrui
> >>
> >> On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das 
> >> wrote:
> >> > I reproduced the problem in mllib tests ALSSuite.scala using the
> >> > following
> >> > functions:
> >> >
> >> > val arrayPredict = userProductsRDD.map{case(user,product) =>
> >> >
> >> >  val recommendedProducts = model.recommendProducts(user,
> >> > products)
> >> >
> >> >  val productScore = recommendedProducts.find{x=>x.product ==
> >> > product}
> >> >
> >> >   require(productScore != None)
> >> >
> >> >   productScore.get
> >> >
> >> > }.collect
> >> >
> >> > arrayPredict.foreach { elem =>
> >> >
> >> >   if (allRatings.get(elem.user, elem.product) != elem.rating)
> >> >
> >> >   fail("Prediction APIs don't match")
> >> >
> >> > }
> >> >
> >> > If the usage of model.recommendProducts is correct, the test fails
> with
> >> > the
> >> > same error I sent before...
> >> >
> >> > org.apache.spark.SparkException: Job aborted due to stage failure:
> Task
> >> > 0 in
> >> > stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in
> stage
> >> > 316.0 (TID 79, localhost): scala.MatchError: null
> >> >
> >> >
> org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
> >> >
> >> >
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
> >> >
> >> > It is a blocker for me and I am debugging it. I will open up a JIRA if
> >> > this
> >> > is indeed a bug...
> >> >
> >> > Do I have to cache the models to make userFeatures.lookup(user).head
> to
> >> > work
> >> > ?
> >> >
> >> >
> >> > On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng 
> wrote:
> >> >>
> >> >> Was "user" presented in training? We can put a check there and return
> >> >> NaN if the user is not included in the model. -Xiangrui
> >> >>
> >> >> On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das <
> debasish.da...@gmail.com>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I am testing MatrixFactorizationModel.predict(user: Int, product:
> >> >> > Int)
> >> >> > but
> >> >> > the code fails on userFeatures.lookup(user).head
> >> >> >
> >> >> > In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)])
> has
> >> >> > been
> >> >> > called and in all the test-cases that API has been used...
> >> >> >
> >> >> > I can perhaps refactor my code to do the same but I was wondering
> >> >> > whether
> >> >> > people test the lookup(user) version of the code..
> >> >> >
> >> >> > Do I need to cache the model to make it work ? I think right now
> >> >> > default
> >> >> > is
> >> >> > STORAGE_AND_DISK...
> >> >> >
> >> >> > Thanks.
> >> >> > Deb
> >> >
> >> >
> >
> >
>