Re: JobConf and ClassPath

2013-04-11 Thread Cyril Bogus
Hi am trying to use mahout jar instead of compiling it with my code.


On Tue, Apr 9, 2013 at 6:01 PM, Dominik Hübner cont...@dhuebner.com wrote:

 Try adding this to your pom file

 build
 plugins
 plugin
 groupIdorg.apache.maven.plugins/groupId
 artifactIdmaven-assembly-plugin/artifactId
 executions
 execution
 idmy-jar-with-dependencies/id
 phasepackage/phase
 goals
 goalsingle/goal
 /goals
 configuration
 descriptorRefs

 descriptorRefjar-with-dependencies/descriptorRef
 /descriptorRefs
 /configuration
 /execution
 /executions
 /plugin
 plugin
 groupIdorg.apache.maven.plugins/groupId
 artifactIdmaven-jar-plugin/artifactId
 /plugin
 /plugins
 /build


 On Apr 9, 2013, at 11:42 PM, Cyril Bogus cyrilbo...@gmail.com wrote:

  To Suneel,
 
  I just ran some code using the google collection class and it is working
  fine so I know it is included.
 
  To Dominik,
 
  You might be right. That would explain why it works in pseudo mode but
 when
  I try on the cluster it does not know where to look anymore.
 
 
  On Tue, Apr 9, 2013 at 5:30 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:




Re: cross recommender

2013-04-11 Thread Pat Ferrel
Getting this running with co-occurrence rather than using a similarity calc on 
user rows finally forced me to understand what is going on in the base 
recommender. And the answer implies further work.

[B'B] is usually not calculated in the usual item based recommender. The matrix 
that comes out of RowSimilairtyJob looking at the purchases input matrix (rows 
= user) is used. This can be a co-occurrence matrix but is actually a 
log-likelihood similarity matrix in my case (substitute your favorite 
similarity measure). 

RowSimilarity works if the rows of one matrix are identical to the columns of 
the other. However when calculating the similarity version of the 
co-occurrence matrix corresponding to [B'A] you need to look at the similarity 
of a row in B with all rows in A. This will give us the analogous similarity 
matrix in the standard recommender. 

All is clear if I have this right. So a better generalization of the aglo would 
use the similarity of rows in B to all rows in A.  So to rename [B'A] to S_ba 
for clarity S_ba would be the similarity matrix calculated from cross 
comparisons of rows/users.

This is fundamentally a new mahout job type AFAIK. It's an important question 
to me because when we looked at similarity measures, log-likelihood gave us 
considerably better scores in the standard recommender. Also looking at the 
values in our [B'A] product I suspect it is not sparsified enough, which would 
be a desired side-effect of using similarity instead of co-occurrence. Also the 
values are not normalized in the same way as the general recommender so they 
can't be linearly combined with it.

Do I have to create a SimilarityJob( matrixB, matrixA, similarityType ) to get 
this or have I missed something already in Mahout?


On Apr 8, 2013, at 2:31 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem
 to work. You loose the ability to substutute different RowSimilarityJob
 measures. I assume this creates something like the co-occurrence similairty
 measure. But oh, well. Maybe I'll look at that later.
 

Yes.  Exactly.



Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Billy
I am very new to Mahout and currently just ready up to chapter 5 of 'MIA'
but after reading about the various User centric and Item centric
recommenders they all seem to still need a userId so still unsure if Mahout
can help with a fairly common recommendation.

My requirement is to produce 'n' item recommendations based on a chosen
item.

E.g. if I've added item #1 to my order then based on all the
other items; in all the other orders for this site, what are the
likely items that I may also want add to my order based; on the item to
item relationship in the history of orders of this site?

Most probably using the most popular relationship between the item I have
chosen and all the items in all the other orders.

My data is not 'user' specific; and I don't think it should be, but more
like order specific as its the pattern of items in each order that should
determine the recommendation.

I have no preference values so merely boolean preferences will be used.

If Mahout can perform these calculations then how must I present the data?

Will I need to shape the data in some way to feed into Mahout (currently
versed in using Hadoop via Aws Emr using Java)

Thanks for the advice in advance,

Billy


Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen
This sounds like just a most-similar-items problem. That's good news
because that's simpler. The only question is how you want to compute
item-item similarities. That could be based on user-item interactions.
If you're on Hadoop, try the RowSimilarityJob (where you will need
rows to be items, columns the users).

On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
 I am very new to Mahout and currently just ready up to chapter 5 of 'MIA'
 but after reading about the various User centric and Item centric
 recommenders they all seem to still need a userId so still unsure if Mahout
 can help with a fairly common recommendation.

 My requirement is to produce 'n' item recommendations based on a chosen
 item.

 E.g. if I've added item #1 to my order then based on all the
 other items; in all the other orders for this site, what are the
 likely items that I may also want add to my order based; on the item to
 item relationship in the history of orders of this site?

 Most probably using the most popular relationship between the item I have
 chosen and all the items in all the other orders.

 My data is not 'user' specific; and I don't think it should be, but more
 like order specific as its the pattern of items in each order that should
 determine the recommendation.

 I have no preference values so merely boolean preferences will be used.

 If Mahout can perform these calculations then how must I present the data?

 Will I need to shape the data in some way to feed into Mahout (currently
 versed in using Hadoop via Aws Emr using Java)

 Thanks for the advice in advance,

 Billy


Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Pat Ferrel
Or you may want to look at recording purchases by user ID. Then use the 
standard recommender to train on (userID, itemsID, boolean). Then query the 
trained recommender thus: recommender.mostSimilarItems(long itemID, int 
howMany) This does what you want but uses more data than just what items were 
purchased together, sound like a shopping-cart recommender.

On Apr 11, 2013, at 10:28 AM, Sean Owen sro...@gmail.com wrote:

This sounds like just a most-similar-items problem. That's good news
because that's simpler. The only question is how you want to compute
item-item similarities. That could be based on user-item interactions.
If you're on Hadoop, try the RowSimilarityJob (where you will need
rows to be items, columns the users).

On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
 I am very new to Mahout and currently just ready up to chapter 5 of 'MIA'
 but after reading about the various User centric and Item centric
 recommenders they all seem to still need a userId so still unsure if Mahout
 can help with a fairly common recommendation.
 
 My requirement is to produce 'n' item recommendations based on a chosen
 item.
 
 E.g. if I've added item #1 to my order then based on all the
 other items; in all the other orders for this site, what are the
 likely items that I may also want add to my order based; on the item to
 item relationship in the history of orders of this site?
 
 Most probably using the most popular relationship between the item I have
 chosen and all the items in all the other orders.
 
 My data is not 'user' specific; and I don't think it should be, but more
 like order specific as its the pattern of items in each order that should
 determine the recommendation.
 
 I have no preference values so merely boolean preferences will be used.
 
 If Mahout can perform these calculations then how must I present the data?
 
 Will I need to shape the data in some way to feed into Mahout (currently
 versed in using Hadoop via Aws Emr using Java)
 
 Thanks for the advice in advance,
 
 Billy



Re: cross recommender

2013-04-11 Thread Sebastian Schelter
 Do I have to create a SimilarityJob( matrixB, matrixA, similarityType
) to get this or have I missed something already in Mahout?

It could be worth to investigate whether MatrixMultiplicationJob could
be extended to compute similarities instead of dot products.

Best,
Sebastian


Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sebastian Schelter
Use ItemSimilarityJob instead of RowSimilarityJob, its the easy-to-use
wrapper around that :)

On 11.04.2013 19:28, Sean Owen wrote:
 This sounds like just a most-similar-items problem. That's good news
 because that's simpler. The only question is how you want to compute
 item-item similarities. That could be based on user-item interactions.
 If you're on Hadoop, try the RowSimilarityJob (where you will need
 rows to be items, columns the users).
 
 On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
 I am very new to Mahout and currently just ready up to chapter 5 of 'MIA'
 but after reading about the various User centric and Item centric
 recommenders they all seem to still need a userId so still unsure if Mahout
 can help with a fairly common recommendation.

 My requirement is to produce 'n' item recommendations based on a chosen
 item.

 E.g. if I've added item #1 to my order then based on all the
 other items; in all the other orders for this site, what are the
 likely items that I may also want add to my order based; on the item to
 item relationship in the history of orders of this site?

 Most probably using the most popular relationship between the item I have
 chosen and all the items in all the other orders.

 My data is not 'user' specific; and I don't think it should be, but more
 like order specific as its the pattern of items in each order that should
 determine the recommendation.

 I have no preference values so merely boolean preferences will be used.

 If Mahout can perform these calculations then how must I present the data?

 Will I need to shape the data in some way to feed into Mahout (currently
 versed in using Hadoop via Aws Emr using Java)

 Thanks for the advice in advance,

 Billy



Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen
You can try treating your orders as the 'users'. Then just compute
item-item similarities per usual.

On Thu, Apr 11, 2013 at 7:59 PM, Billy b...@ntlworld.com wrote:
 Thanks for replying,


 I don't have users, well I do :-) but in this case it should not influence
 the recommendations

 ,
 these need to be based on the relationship between
 
 items ordered with other items
 in the 'same order'
 .

 E.g. If item #1 has been order with item #4

 [
 22
 ]
 times and item #1 has been order with item #9
 [
 57
 ]
 times then
 if I added item #1 to my order
 these would both be recommended
 but item #9 would be recommended above item #4 purely based on the fact that
 the relationship between item #1 and item #9 is greater than the
 relationship with item #4.

 What I don't want is; if a user ordered items #A, #B, #C separately
 'at some point in their order history' then recommen
 d #A and #C to other users who order #B ... I still don't want this if the
 items are similar and/or the users similar.

 Cheers

 Billy



 On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote:

 This sounds like just a most-similar-items problem. That's good news
 because that's simpler. The only question is how you want to compute
 item-item similarities. That could be based on user-item interactions.
 If you're on Hadoop, try the RowSimilarityJob (where you will need
 rows to be items, columns the users).

 On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
  I am very new to Mahout and currently just ready up to chapter 5 of
  'MIA'
  but after reading about the various User centric and Item centric
  recommenders they all seem to still need a userId so still unsure if
  Mahout
  can help with a fairly common recommendation.
 
  My requirement is to produce 'n' item recommendations based on a chosen
  item.
 
  E.g. if I've added item #1 to my order then based on all the
  other items; in all the other orders for this site, what are the
  likely items that I may also want add to my order based; on the item to
  item relationship in the history of orders of this site?
 
  Most probably using the most popular relationship between the item I
  have
  chosen and all the items in all the other orders.
 
  My data is not 'user' specific; and I don't think it should be, but more
  like order specific as its the pattern of items in each order that
  should
  determine the recommendation.
 
  I have no preference values so merely boolean preferences will be used.
 
  If Mahout can perform these calculations then how must I present the
  data?
 
  Will I need to shape the data in some way to feed into Mahout (currently
  versed in using Hadoop via Aws Emr using Java)
 
  Thanks for the advice in advance,
 
  Billy


Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Ted Dunning
Actually, making this user based is a really good thing because you get
recommendations from one session to the next.  These may be much more
valuable for cross-sell than things in the same order.


On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote:

 You can try treating your orders as the 'users'. Then just compute
 item-item similarities per usual.

 On Thu, Apr 11, 2013 at 7:59 PM, Billy b...@ntlworld.com wrote:
  Thanks for replying,
 
 
  I don't have users, well I do :-) but in this case it should not
 influence
  the recommendations
 
  ,
  these need to be based on the relationship between
  
  items ordered with other items
  in the 'same order'
  .
 
  E.g. If item #1 has been order with item #4
 
  [
  22
  ]
  times and item #1 has been order with item #9
  [
  57
  ]
  times then
  if I added item #1 to my order
  these would both be recommended
  but item #9 would be recommended above item #4 purely based on the fact
 that
  the relationship between item #1 and item #9 is greater than the
  relationship with item #4.
 
  What I don't want is; if a user ordered items #A, #B, #C separately
  'at some point in their order history' then recommen
  d #A and #C to other users who order #B ... I still don't want this if
 the
  items are similar and/or the users similar.
 
  Cheers
 
  Billy
 
 
 
  On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote:
 
  This sounds like just a most-similar-items problem. That's good news
  because that's simpler. The only question is how you want to compute
  item-item similarities. That could be based on user-item interactions.
  If you're on Hadoop, try the RowSimilarityJob (where you will need
  rows to be items, columns the users).
 
  On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
   I am very new to Mahout and currently just ready up to chapter 5 of
   'MIA'
   but after reading about the various User centric and Item centric
   recommenders they all seem to still need a userId so still unsure if
   Mahout
   can help with a fairly common recommendation.
  
   My requirement is to produce 'n' item recommendations based on a
 chosen
   item.
  
   E.g. if I've added item #1 to my order then based on all the
   other items; in all the other orders for this site, what are the
   likely items that I may also want add to my order based; on the item
 to
   item relationship in the history of orders of this site?
  
   Most probably using the most popular relationship between the item I
   have
   chosen and all the items in all the other orders.
  
   My data is not 'user' specific; and I don't think it should be, but
 more
   like order specific as its the pattern of items in each order that
   should
   determine the recommendation.
  
   I have no preference values so merely boolean preferences will be
 used.
  
   If Mahout can perform these calculations then how must I present the
   data?
  
   Will I need to shape the data in some way to feed into Mahout
 (currently
   versed in using Hadoop via Aws Emr using Java)
  
   Thanks for the advice in advance,
  
   Billy



Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Billy
As in the example data 'intro.csv' in the MIA it has users 1-5 so if I ask
for recommendations for user 1 then this works but if I ask for
recommendations for user 6 (a new user yet to be added to the data model)
then I get no recommendations ... so if I substitute users for orders then
again I will get no recommendations ... which I sort of understand so do I
need to inject my 'new' active order; along with its attached item/s into
the data model first and then ask for the recommendations for the order by
offering up the new orderId? or is there a way of merely offering up an
'item' and then getting recommendations based merely on the item using the
data already stored and the relationships with my item?

My assumptions:
#1
I am assuming the data model is a static island of data that has been
processed (flattened) overnight (most probably by an Hadoop process) due to
the size of this data ... rather than a living document that is updated as
soon as new data is available.
#2
I'm also assuming that instead of reading in the data model and
providing recommendations 'on the fly' I will have to run thru every item
in my catalogue and find out the top 5 recommended items that are ordered
with each item (most probably via a Hadoop process) and then store this
output in dynamoDb or luncene for quick access.

Sorry for all the questions but it such an interesting subject.


On 11 April 2013 22:04, Ted Dunning ted.dunn...@gmail.com wrote:

 Actually, making this user based is a really good thing because you get
 recommendations from one session to the next.  These may be much more
 valuable for cross-sell than things in the same order.


 On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote:

 You can try treating your orders as the 'users'. Then just compute
 item-item similarities per usual.

 On Thu, Apr 11, 2013 at 7:59 PM, Billy b...@ntlworld.com wrote:
  Thanks for replying,
 
 
  I don't have users, well I do :-) but in this case it should not
 influence
  the recommendations
 
  ,
  these need to be based on the relationship between
  
  items ordered with other items
  in the 'same order'
  .
 
  E.g. If item #1 has been order with item #4
 
  [
  22
  ]
  times and item #1 has been order with item #9
  [
  57
  ]
  times then
  if I added item #1 to my order
  these would both be recommended
  but item #9 would be recommended above item #4 purely based on the fact
 that
  the relationship between item #1 and item #9 is greater than the
  relationship with item #4.
 
  What I don't want is; if a user ordered items #A, #B, #C separately
  'at some point in their order history' then recommen
  d #A and #C to other users who order #B ... I still don't want this if
 the
  items are similar and/or the users similar.
 
  Cheers
 
  Billy
 
 
 
  On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote:
 
  This sounds like just a most-similar-items problem. That's good news
  because that's simpler. The only question is how you want to compute
  item-item similarities. That could be based on user-item interactions.
  If you're on Hadoop, try the RowSimilarityJob (where you will need
  rows to be items, columns the users).
 
  On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
   I am very new to Mahout and currently just ready up to chapter 5 of
   'MIA'
   but after reading about the various User centric and Item centric
   recommenders they all seem to still need a userId so still unsure if
   Mahout
   can help with a fairly common recommendation.
  
   My requirement is to produce 'n' item recommendations based on a
 chosen
   item.
  
   E.g. if I've added item #1 to my order then based on all the
   other items; in all the other orders for this site, what are the
   likely items that I may also want add to my order based; on the item
 to
   item relationship in the history of orders of this site?
  
   Most probably using the most popular relationship between the item I
   have
   chosen and all the items in all the other orders.
  
   My data is not 'user' specific; and I don't think it should be, but
 more
   like order specific as its the pattern of items in each order that
   should
   determine the recommendation.
  
   I have no preference values so merely boolean preferences will be
 used.
  
   If Mahout can perform these calculations then how must I present the
   data?
  
   Will I need to shape the data in some way to feed into Mahout
 (currently
   versed in using Hadoop via Aws Emr using Java)
  
   Thanks for the advice in advance,
  
   Billy





Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen
You can actually create a user #6 for your new order. Or you can use
the anonymous user function of the library, although it's hacky.

We may be mixing up terms here. DataModel is a class that has
nothing to do with Hadoop. Hadoop in turn has no part in real-time
anything, like recommending to a brand-new user. However it could
build an offline model of item-item similarities and you could do
something like a most-similar-items computation for a given new basket
of goods. That is effectively what the anonymous user function is
doing anyway.

You can precompute all recommendations for all items but that's a lot
of work! It's easy to get away with it with a thousand items, but with
a million this may be infeasibly slow.

On Thu, Apr 11, 2013 at 10:38 PM, Billy b...@ntlworld.com wrote:
 As in the example data 'intro.csv' in the MIA it has users 1-5 so if I ask
 for recommendations for user 1 then this works but if I ask for
 recommendations for user 6 (a new user yet to be added to the data model)
 then I get no recommendations ... so if I substitute users for orders then
 again I will get no recommendations ... which I sort of understand so do I
 need to inject my 'new' active order; along with its attached item/s into
 the data model first and then ask for the recommendations for the order by
 offering up the new orderId? or is there a way of merely offering up an
 'item' and then getting recommendations based merely on the item using the
 data already stored and the relationships with my item?

 My assumptions:
 #1
 I am assuming the data model is a static island of data that has been
 processed (flattened) overnight (most probably by an Hadoop process) due to
 the size of this data ... rather than a living document that is updated as
 soon as new data is available.
 #2
 I'm also assuming that instead of reading in the data model and
 providing recommendations 'on the fly' I will have to run thru every item
 in my catalogue and find out the top 5 recommended items that are ordered
 with each item (most probably via a Hadoop process) and then store this
 output in dynamoDb or luncene for quick access.

 Sorry for all the questions but it such an interesting subject.


 On 11 April 2013 22:04, Ted Dunning ted.dunn...@gmail.com wrote:

 Actually, making this user based is a really good thing because you get
 recommendations from one session to the next.  These may be much more
 valuable for cross-sell than things in the same order.


 On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote:

 You can try treating your orders as the 'users'. Then just compute
 item-item similarities per usual.

 On Thu, Apr 11, 2013 at 7:59 PM, Billy b...@ntlworld.com wrote:
  Thanks for replying,
 
 
  I don't have users, well I do :-) but in this case it should not
 influence
  the recommendations
 
  ,
  these need to be based on the relationship between
  
  items ordered with other items
  in the 'same order'
  .
 
  E.g. If item #1 has been order with item #4
 
  [
  22
  ]
  times and item #1 has been order with item #9
  [
  57
  ]
  times then
  if I added item #1 to my order
  these would both be recommended
  but item #9 would be recommended above item #4 purely based on the fact
 that
  the relationship between item #1 and item #9 is greater than the
  relationship with item #4.
 
  What I don't want is; if a user ordered items #A, #B, #C separately
  'at some point in their order history' then recommen
  d #A and #C to other users who order #B ... I still don't want this if
 the
  items are similar and/or the users similar.
 
  Cheers
 
  Billy
 
 
 
  On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote:
 
  This sounds like just a most-similar-items problem. That's good news
  because that's simpler. The only question is how you want to compute
  item-item similarities. That could be based on user-item interactions.
  If you're on Hadoop, try the RowSimilarityJob (where you will need
  rows to be items, columns the users).
 
  On Thu, Apr 11, 2013 at 6:11 PM, Billy b...@ntlworld.com wrote:
   I am very new to Mahout and currently just ready up to chapter 5 of
   'MIA'
   but after reading about the various User centric and Item centric
   recommenders they all seem to still need a userId so still unsure if
   Mahout
   can help with a fairly common recommendation.
  
   My requirement is to produce 'n' item recommendations based on a
 chosen
   item.
  
   E.g. if I've added item #1 to my order then based on all the
   other items; in all the other orders for this site, what are the
   likely items that I may also want add to my order based; on the item
 to
   item relationship in the history of orders of this site?
  
   Most probably using the most popular relationship between the item I
   have
   chosen and all the items in all the other orders.
  
   My data is not 'user' specific; and I don't think it should be, but
 more
   like order specific as its the pattern of items in each order that
  

Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Ted Dunning
These numbers don't match what I get.

I get LLR = 117.

This is wildly anomalous so this pair should definitely be connected.  Both
items are quite rare (15/300,000 or 20/300,000 rates) but they occur
together most of the time that they appear.



On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai baizh...@gmail.com wrote:

 Hi,

 the counts for two events are:
 * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
 k21=13**k22=300,000*
 according to the code, I will get:

 rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
 colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
 matrixEntropy(entropy(7,8,13, 300,000) = 458

 thus,

 LLR=2.0*(458-222-152) = 168
 similarityScore = 1 - 1/(1+168) = 0.994

 So, my problem is,
 the similarity scores I get for all the items are all this high and it
 makes it so hard to identify the real similar ones.

 As you can see, the counts of event A, and B are quite small while the
 total count for k22 is quite high. And this phenomenon is quite common in
 my dataset.

 So, my question is,
 what kind of adjustment could I do to lower the similarity score to a more
 reasonable range?

 Please shed some lights, thanks in advance!



Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Ted Dunning
Counts are critical here.

Suppose that two rare events occur together the first time you ever see
them.  How exciting is this?  Not very in my mind, but not necessarily
trivial.

Now suppose that they occur together 20 times and never occur alone after
you have collected 20 times more data. This is a huge deal.

Without counts, you can't see the difference.




On Wed, Apr 10, 2013 at 3:18 AM, Phoenix Bai baizh...@gmail.com wrote:

 Good point.

 btw, why use counts instead of probabilities? for easy and efficient
 implementation?
 also, do you think the similarity score using counts might quite differ
 from using probabilities?

 thank you very much for your prompt reply. [?]


 On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen sro...@gmail.com wrote:

 These events do sound 'similar'. They occur together about half the
 time either one of them occurs. You might have many pairs that end up
 being similar for the same reason, and this is not surprising. They're
 all really similar.

 The mapping here from LLR's range of [0,inf) to [0,1] is pretty
 arbitrary, but it is an increasing function of LLR. So the ordering
 you get is exactly the ordering LLR dictates. Yes you are going to get
 a number of values near 1 at the top, but does it matter?

 LLR = 0 and similarity = 0 when the events appear perfectly
 independent. For example, if A and B occur with probability 10%,
 independently, then you might have k11 = 1, k12 = 9, k21 = 9, k22 =
 81. The matrix (joint probability) has no more info than the marginal
 probabilities, so the matrix entropy == row entropy + col entropy and
 LLR = 0.


 On Wed, Apr 10, 2013 at 10:15 AM, Phoenix Bai baizh...@gmail.com wrote:
  Hi,
 
  the counts for two events are:
  * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
  k21=13**k22=300,000*
  according to the code, I will get:
 
  rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
  colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
  matrixEntropy(entropy(7,8,13, 300,000) = 458
 
  thus,
 
  LLR=2.0*(458-222-152) = 168
  similarityScore = 1 - 1/(1+168) = 0.994
 
  So, my problem is,
  the similarity scores I get for all the items are all this high and it
  makes it so hard to identify the real similar ones.
 
  As you can see, the counts of event A, and B are quite small while the
  total count for k22 is quite high. And this phenomenon is quite common
 in
  my dataset.
 
  So, my question is,
  what kind of adjustment could I do to lower the similarity score to a
 more
  reasonable range?
 
  Please shed some lights, thanks in advance!





Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Sean Owen
Yes I also get (er, Mahout gets) 117 (116.69), FWIW.

I think the second question concerned counts vs relative frequencies
-- normalized, or not. Like whether you divide all the counts by their
sum or not. For a fixed set of observations that does change the LLR
because it is unnormalized, not because the situation has changed.

Obviously you're right that the changing situations you describe do
entail a change in LLR!

On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 These numbers don't match what I get.

 I get LLR = 117.

 This is wildly anomalous so this pair should definitely be connected.  Both
 items are quite rare (15/300,000 or 20/300,000 rates) but they occur
 together most of the time that they appear.



 On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai baizh...@gmail.com wrote:

 Hi,

 the counts for two events are:
 * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
 k21=13**k22=300,000*
 according to the code, I will get:

 rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
 colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
 matrixEntropy(entropy(7,8,13, 300,000) = 458

 thus,

 LLR=2.0*(458-222-152) = 168
 similarityScore = 1 - 1/(1+168) = 0.994

 So, my problem is,
 the similarity scores I get for all the items are all this high and it
 makes it so hard to identify the real similar ones.

 As you can see, the counts of event A, and B are quite small while the
 total count for k22 is quite high. And this phenomenon is quite common in
 my dataset.

 So, my question is,
 what kind of adjustment could I do to lower the similarity score to a more
 reasonable range?

 Please shed some lights, thanks in advance!



Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Pat Ferrel
Do you not have a user ID? No matter (though if you do I'd use it) you can use 
the item ID as a surrogate for a user ID in the recommender. And there will be 
no filtering if you ask for recommender.mostSimilarItems(long itemID, int 
howMany), which has no user ID in the call and so will not filter. Since the 
recommender doesn't know you are using item IDs for user IDs this should work 
fine.

This allows you to use the in-memory version of the recommender as it is 
described in MiA. The Row and ItemSimilarityJobs are mapreduce and will produce 
results for all items in a batch. This is fine and will produce much the same 
results but you will have to code up the query part yourself as a 
runtime/live/service component. Using the in-memory recommender gives you a 
query interface to call whenever you are showing a page to the user.

Using the user ID will allow you to make and blend in user based 
recommendations, which are calculated based on individual user history. They 
may not be your primary recommendations, but when you dont have enough item 
similarities, you can fall back or blend in user recommendations.

On Apr 11, 2013, at 2:42 PM, Sean Owen sro...@gmail.com wrote:

You can actually create a user #6 for your new order. Or you can use
the anonymous user function of the library, although it's hacky.

We may be mixing up terms here. DataModel is a class that has
nothing to do with Hadoop. Hadoop in turn has no part in real-time
anything, like recommending to a brand-new user. However it could
build an offline model of item-item similarities and you could do
something like a most-similar-items computation for a given new basket
of goods. That is effectively what the anonymous user function is
doing anyway.

You can precompute all recommendations for all items but that's a lot
of work! It's easy to get away with it with a thousand items, but with
a million this may be infeasibly slow.

On Thu, Apr 11, 2013 at 10:38 PM, Billy b...@ntlworld.com wrote:
 As in the example data 'intro.csv' in the MIA it has users 1-5 so if I ask
 for recommendations for user 1 then this works but if I ask for
 recommendations for user 6 (a new user yet to be added to the data model)
 then I get no recommendations ... so if I substitute users for orders then
 again I will get no recommendations ... which I sort of understand so do I
 need to inject my 'new' active order; along with its attached item/s into
 the data model first and then ask for the recommendations for the order by
 offering up the new orderId? or is there a way of merely offering up an
 'item' and then getting recommendations based merely on the item using the
 data already stored and the relationships with my item?
 
 My assumptions:
 #1
 I am assuming the data model is a static island of data that has been
 processed (flattened) overnight (most probably by an Hadoop process) due to
 the size of this data ... rather than a living document that is updated as
 soon as new data is available.
 #2
 I'm also assuming that instead of reading in the data model and
 providing recommendations 'on the fly' I will have to run thru every item
 in my catalogue and find out the top 5 recommended items that are ordered
 with each item (most probably via a Hadoop process) and then store this
 output in dynamoDb or luncene for quick access.
 
 Sorry for all the questions but it such an interesting subject.
 
 
 On 11 April 2013 22:04, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Actually, making this user based is a really good thing because you get
 recommendations from one session to the next.  These may be much more
 valuable for cross-sell than things in the same order.
 
 
 On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote:
 
 You can try treating your orders as the 'users'. Then just compute
 item-item similarities per usual.
 
 On Thu, Apr 11, 2013 at 7:59 PM, Billy b...@ntlworld.com wrote:
 Thanks for replying,
 
 
 I don't have users, well I do :-) but in this case it should not
 influence
 the recommendations
 
 ,
 these need to be based on the relationship between
 
 items ordered with other items
 in the 'same order'
 .
 
 E.g. If item #1 has been order with item #4
 
 [
 22
 ]
 times and item #1 has been order with item #9
 [
 57
 ]
 times then
 if I added item #1 to my order
 these would both be recommended
 but item #9 would be recommended above item #4 purely based on the fact
 that
 the relationship between item #1 and item #9 is greater than the
 relationship with item #4.
 
 What I don't want is; if a user ordered items #A, #B, #C separately
 'at some point in their order history' then recommen
 d #A and #C to other users who order #B ... I still don't want this if
 the
 items are similar and/or the users similar.
 
 Cheers
 
 Billy
 
 
 
 On 11 Apr 2013 18:28, Sean Owen sro...@gmail.com wrote:
 
 This sounds like just a most-similar-items problem. That's good news
 because that's simpler. The only question is how you want to compute
 

trainclassifier -type cbayes dumps text

2013-04-11 Thread Ryan Compton
I'm trying to train a simple text classifier using cbayes. I've got
formatted Text,Text sequence files created with
com.twitter.elephantbird.pig.store.SequenceFileStorage(), eg:

JOY  actually turning decent new year ☺
JOY  best New Years tonight! ready 2013. U+1F609 U+1F38AU+1F389
JOY  playing Dream League Soccer iPad 2 earned 13 coins!
JOY  Great way start new ear
JOY  good sober New Years Eve
ANGER_RAGE   Last night frank hasn't done revision prelims
ANGER_RAGE   hell cut forehead such ball ache! Cheers pleb chucks
glass bottles around!
ANGER_RAGE   shops open today customer services shut apparently
being paid come back tomorrow.

These are stored in a directory as:
/emotion-training-labeled/part-m-*

I pass the labeled data into cbayes:

mahout trainclassifier -i /emotion-training-labeled/ -o emotion-model/
-type cbayes -ng 1 -source hdfs

Both map and reduce get to 100%,  then I see something about Tf-Idf
followed by what looks like a complete dump of my training data print
to the screen for the next few minutes and then a stack trace:

rything life teach lesson, willing observe learn.” YUP!GJOYB Halbrecht
DAN CASTAIC CA found local Videographer. Register FREE:JOY Palm Read
Easy Created WorldJOY=1.0, ANGER_RAGE people fisty latelyK=1.0,
ANGER_RAGE ew gon lot em ��=1.0, ANGER_RAGE ain't gonna love =1.0}
13/04/11 15:46:51 INFO common.BayesTfIdfDriver: {dataSource=hdfs,
alpha_i=1.0, minDf=1, gramSize=1}
13/04/11 15:46:51 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
13/04/11 15:46:57 INFO mapred.FileInputFormat: Total input paths to process : 3
13/04/11 15:46:58 INFO mapred.JobClient: Cleaning up the staging area
hdfs://master/user/rfcompton/.staging/job_201303271312_2786
13/04/11 15:46:58 ERROR security.UserGroupInformation:
PriviledgedActionException as:rfcompton (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException:
java.io.IOException: Exceeded max jobconf size: 10706309 limit:
5242880
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
Caused by: java.io.IOException: Exceeded max jobconf size: 10706309
limit: 5242880
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
... 10 more

Exception in thread main org.apache.hadoop.ipc.RemoteException:
java.io.IOException: java.io.IOException: Exceeded max jobconf size:
10706309 limit: 5242880
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
Caused by: java.io.IOException: Exceeded max jobconf size: 10706309
limit: 5242880
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
... 10 more

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at 

Re: trainclassifier -type cbayes dumps text

2013-04-11 Thread Ryan Compton
Also, right before the screen dump I see:

13/04/11 15:46:40 INFO mapred.JobClient: Combine output records=462236
13/04/11 15:46:40 INFO mapred.JobClient: Physical memory (bytes)
snapshot=1618497536
13/04/11 15:46:40 INFO mapred.JobClient: Reduce output records=419058
13/04/11 15:46:40 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=4697526272
13/04/11 15:46:40 INFO mapred.JobClient: Map output records=702535
13/04/11 15:46:40 INFO cbayes.CBayesDriver: Calculating Tf-Idf...
13/04/11 15:46:41 INFO common.BayesTfIdfDriver: Counts of documents in
Each Label
13/04/11 15:46:42 INFO common.BayesTfIdfDriver: {ANGER_RAGE  family's
personal fucking bank.=1.0, ANGER_RAGE give up life...=1.0, ANGER_RAGE
understand peopleS=1.0, ANGER_RAGE many episodes record day?5=1.0,
ANGER_RAGE! need punching bag take out angerC=1.0, ANGER_RAGE right
now�� insults make laugh.A=1.0, ANGER_RAGEunny a

On Thu, Apr 11, 2013 at 3:58 PM, Ryan Compton compton.r...@gmail.com wrote:
 I'm trying to train a simple text classifier using cbayes. I've got
 formatted Text,Text sequence files created with
 com.twitter.elephantbird.pig.store.SequenceFileStorage(), eg:

 JOY  actually turning decent new year ☺
 JOY  best New Years tonight! ready 2013. U+1F609 U+1F38AU+1F389
 JOY  playing Dream League Soccer iPad 2 earned 13 coins!
 JOY  Great way start new ear
 JOY  good sober New Years Eve
 ANGER_RAGE   Last night frank hasn't done revision prelims
 ANGER_RAGE   hell cut forehead such ball ache! Cheers pleb chucks
 glass bottles around!
 ANGER_RAGE   shops open today customer services shut apparently
 being paid come back tomorrow.

 These are stored in a directory as:
 /emotion-training-labeled/part-m-*

 I pass the labeled data into cbayes:

 mahout trainclassifier -i /emotion-training-labeled/ -o emotion-model/
 -type cbayes -ng 1 -source hdfs

 Both map and reduce get to 100%,  then I see something about Tf-Idf
 followed by what looks like a complete dump of my training data print
 to the screen for the next few minutes and then a stack trace:

 rything life teach lesson, willing observe learn.” YUP!GJOYB Halbrecht
 DAN CASTAIC CA found local Videographer. Register FREE:JOY Palm Read
 Easy Created WorldJOY=1.0, ANGER_RAGE people fisty latelyK=1.0,
 ANGER_RAGE ew gon lot em ��=1.0, ANGER_RAGE ain't gonna love =1.0}
 13/04/11 15:46:51 INFO common.BayesTfIdfDriver: {dataSource=hdfs,
 alpha_i=1.0, minDf=1, gramSize=1}
 13/04/11 15:46:51 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the
 same.
 13/04/11 15:46:57 INFO mapred.FileInputFormat: Total input paths to process : 
 3
 13/04/11 15:46:58 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://master/user/rfcompton/.staging/job_201303271312_2786
 13/04/11 15:46:58 ERROR security.UserGroupInformation:
 PriviledgedActionException as:rfcompton (auth:SIMPLE)
 cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException:
 java.io.IOException: Exceeded max jobconf size: 10706309 limit:
 5242880
 at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
 at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
 Caused by: java.io.IOException: Exceeded max jobconf size: 10706309
 limit: 5242880
 at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406)
 at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
 ... 10 more

 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 java.io.IOException: java.io.IOException: Exceeded max jobconf size:
 10706309 limit: 5242880
 at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
 at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
 at java.security.AccessController.doPrivileged(Native Method)
 at 

Re: trainclassifier -type cbayes dumps text

2013-04-11 Thread Ryan Compton
Ok I think I got it.

The problem was that I wasn't naming the files properly. If I'm not
mistaken I'll need to organize my training data like:
-bash-3.2$ hadoop dfs -lsr /user/rfcompton/emotion-training-labeled/
-rw-r--r--   3 rfcompton hadoop2896850 2013-04-11 16:23
/user/rfcompton/emotion-training-labeled/ANGER_RAGE
-rw-r--r--   3 rfcompton hadoop3239449 2013-04-11 16:24
/user/rfcompton/emotion-training-labeled/JOY

where the contents of /user/rfcompton/emotion-training-labeled/JOY look like:
JOY  actually turning decent new year ☺
JOY  best New Years tonight! ready 2013. U+1F609 U+1F38AU+1F389
...



On Thu, Apr 11, 2013 at 4:02 PM, Ryan Compton compton.r...@gmail.com wrote:
 Also, right before the screen dump I see:

 13/04/11 15:46:40 INFO mapred.JobClient: Combine output records=462236
 13/04/11 15:46:40 INFO mapred.JobClient: Physical memory (bytes)
 snapshot=1618497536
 13/04/11 15:46:40 INFO mapred.JobClient: Reduce output records=419058
 13/04/11 15:46:40 INFO mapred.JobClient: Virtual memory (bytes)
 snapshot=4697526272
 13/04/11 15:46:40 INFO mapred.JobClient: Map output records=702535
 13/04/11 15:46:40 INFO cbayes.CBayesDriver: Calculating Tf-Idf...
 13/04/11 15:46:41 INFO common.BayesTfIdfDriver: Counts of documents in
 Each Label
 13/04/11 15:46:42 INFO common.BayesTfIdfDriver: {ANGER_RAGE  family's
 personal fucking bank.=1.0, ANGER_RAGE give up life...=1.0, ANGER_RAGE
 understand peopleS=1.0, ANGER_RAGE many episodes record day?5=1.0,
 ANGER_RAGE! need punching bag take out angerC=1.0, ANGER_RAGE right
 now�� insults make laugh.A=1.0, ANGER_RAGEunny a

 On Thu, Apr 11, 2013 at 3:58 PM, Ryan Compton compton.r...@gmail.com wrote:
 I'm trying to train a simple text classifier using cbayes. I've got
 formatted Text,Text sequence files created with
 com.twitter.elephantbird.pig.store.SequenceFileStorage(), eg:

 JOY  actually turning decent new year ☺
 JOY  best New Years tonight! ready 2013. U+1F609 U+1F38AU+1F389
 JOY  playing Dream League Soccer iPad 2 earned 13 coins!
 JOY  Great way start new ear
 JOY  good sober New Years Eve
 ANGER_RAGE   Last night frank hasn't done revision prelims
 ANGER_RAGE   hell cut forehead such ball ache! Cheers pleb chucks
 glass bottles around!
 ANGER_RAGE   shops open today customer services shut apparently
 being paid come back tomorrow.

 These are stored in a directory as:
 /emotion-training-labeled/part-m-*

 I pass the labeled data into cbayes:

 mahout trainclassifier -i /emotion-training-labeled/ -o emotion-model/
 -type cbayes -ng 1 -source hdfs

 Both map and reduce get to 100%,  then I see something about Tf-Idf
 followed by what looks like a complete dump of my training data print
 to the screen for the next few minutes and then a stack trace:

 rything life teach lesson, willing observe learn.” YUP!GJOYB Halbrecht
 DAN CASTAIC CA found local Videographer. Register FREE:JOY Palm Read
 Easy Created WorldJOY=1.0, ANGER_RAGE people fisty latelyK=1.0,
 ANGER_RAGE ew gon lot em ��=1.0, ANGER_RAGE ain't gonna love =1.0}
 13/04/11 15:46:51 INFO common.BayesTfIdfDriver: {dataSource=hdfs,
 alpha_i=1.0, minDf=1, gramSize=1}
 13/04/11 15:46:51 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the
 same.
 13/04/11 15:46:57 INFO mapred.FileInputFormat: Total input paths to process 
 : 3
 13/04/11 15:46:58 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://master/user/rfcompton/.staging/job_201303271312_2786
 13/04/11 15:46:58 ERROR security.UserGroupInformation:
 PriviledgedActionException as:rfcompton (auth:SIMPLE)
 cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException:
 java.io.IOException: Exceeded max jobconf size: 10706309 limit:
 5242880
 at 
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
 at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
 Caused by: java.io.IOException: Exceeded max jobconf size: 10706309
 limit: 5242880
 at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406)
 at 
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
 ... 10 more

 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sebastian Schelter
You can also use the new MultithreadedBatchItemSimilarities class to
efficiently precompute item similarities on a single machine without
having to go to MapReduce.

On 12.04.2013 00:54, Pat Ferrel wrote:
 Do you not have a user ID? No matter (though if you do I'd use it) you can 
 use the item ID as a surrogate for a user ID in the recommender. And there 
 will be no filtering if you ask for recommender.mostSimilarItems(long itemID, 
 int howMany), which has no user ID in the call and so will not filter. Since 
 the recommender doesn't know you are using item IDs for user IDs this should 
 work fine.
 
 This allows you to use the in-memory version of the recommender as it is 
 described in MiA. The Row and ItemSimilarityJobs are mapreduce and will 
 produce results for all items in a batch. This is fine and will produce much 
 the same results but you will have to code up the query part yourself as a 
 runtime/live/service component. Using the in-memory recommender gives you a 
 query interface to call whenever you are showing a page to the user.
 
 Using the user ID will allow you to make and blend in user based 
 recommendations, which are calculated based on individual user history. They 
 may not be your primary recommendations, but when you dont have enough item 
 similarities, you can fall back or blend in user recommendations.
 
 On Apr 11, 2013, at 2:42 PM, Sean Owen sro...@gmail.com wrote:
 
 You can actually create a user #6 for your new order. Or you can use
 the anonymous user function of the library, although it's hacky.
 
 We may be mixing up terms here. DataModel is a class that has
 nothing to do with Hadoop. Hadoop in turn has no part in real-time
 anything, like recommending to a brand-new user. However it could
 build an offline model of item-item similarities and you could do
 something like a most-similar-items computation for a given new basket
 of goods. That is effectively what the anonymous user function is
 doing anyway.
 
 You can precompute all recommendations for all items but that's a lot
 of work! It's easy to get away with it with a thousand items, but with
 a million this may be infeasibly slow.
 
 On Thu, Apr 11, 2013 at 10:38 PM, Billy b...@ntlworld.com wrote:
 As in the example data 'intro.csv' in the MIA it has users 1-5 so if I ask
 for recommendations for user 1 then this works but if I ask for
 recommendations for user 6 (a new user yet to be added to the data model)
 then I get no recommendations ... so if I substitute users for orders then
 again I will get no recommendations ... which I sort of understand so do I
 need to inject my 'new' active order; along with its attached item/s into
 the data model first and then ask for the recommendations for the order by
 offering up the new orderId? or is there a way of merely offering up an
 'item' and then getting recommendations based merely on the item using the
 data already stored and the relationships with my item?

 My assumptions:
 #1
 I am assuming the data model is a static island of data that has been
 processed (flattened) overnight (most probably by an Hadoop process) due to
 the size of this data ... rather than a living document that is updated as
 soon as new data is available.
 #2
 I'm also assuming that instead of reading in the data model and
 providing recommendations 'on the fly' I will have to run thru every item
 in my catalogue and find out the top 5 recommended items that are ordered
 with each item (most probably via a Hadoop process) and then store this
 output in dynamoDb or luncene for quick access.

 Sorry for all the questions but it such an interesting subject.


 On 11 April 2013 22:04, Ted Dunning ted.dunn...@gmail.com wrote:

 Actually, making this user based is a really good thing because you get
 recommendations from one session to the next.  These may be much more
 valuable for cross-sell than things in the same order.


 On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen sro...@gmail.com wrote:

 You can try treating your orders as the 'users'. Then just compute
 item-item similarities per usual.

 On Thu, Apr 11, 2013 at 7:59 PM, Billy b...@ntlworld.com wrote:
 Thanks for replying,


 I don't have users, well I do :-) but in this case it should not
 influence
 the recommendations

 ,
 these need to be based on the relationship between
 
 items ordered with other items
 in the 'same order'
 .

 E.g. If item #1 has been order with item #4

 [
 22
 ]
 times and item #1 has been order with item #9
 [
 57
 ]
 times then
 if I added item #1 to my order
 these would both be recommended
 but item #9 would be recommended above item #4 purely based on the fact
 that
 the relationship between item #1 and item #9 is greater than the
 relationship with item #4.

 What I don't want is; if a user ordered items #A, #B, #C separately
 'at some point in their order history' then recommen
 d #A and #C to other users who order #B ... I still don't want this if
 the
 items are similar and/or the users