Re: Which database should I use with Mahout

2013-05-23 Thread Ted Dunning
I think the simplest implementation is to just get extra results from the
recommender and rescore after the rough retrieval.  Integrating this into
the actual scoring engine is very hard since it depends on global
characteristics of the final result.

The same applies to result set clustering.



On Wed, May 22, 2013 at 9:51 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Yeah, that is what i had in mind as a simple solution. For examining bigger
> result sets I always fear the cost of loading a lot of stored fields,
> that's why i thought of including it in the scoring might be cool. It's not
> possible with a plain Collector that maintains a priorityqueue of docid and
> score but there should be some smart way to maintain a coordinate of the
> top ranking items so far. Anyway, that's a different story
>
> Thanks for the help so far!
>
>
> On Thu, May 23, 2013 at 2:14 AM, Ted Dunning 
> wrote:
>
> > Yes what you are describing with diversification is something that I have
> > called anti-flood. It comes from the fact that we really are optimizing a
> > portfolio of recommendations rather than a batch of independent
> > recommendations. Doing this from first principles is very hard but there
> > are very simple heuristics that do well in many practical situations.
>  For
> > instance, simply penalizing the score of items based on how many items
> > ranked above them that they are excessively close to does wonders.
> >
> > Sent from my iPhone
> >
> > On May 22, 2013, at 13:04, Johannes Schulte 
> > wrote:
> >
> > > Okay i got it! I also always have used a basic form of dithering but we
> > > always called it shuffle since it basically was / is
> Collections.shuffle
> > on
> > > a bigger list of results and therefore doesnt take the rank or score
> into
> > > account. Will try that..
> > >
> > > With diversification i really meant more something that guarantees that
> > you
> > > maximize the intra-item distance (dithering often does but not on
> > purpose).
> > > The theory is, if i remember correctly, that if a user goes over the
> list
> > > from top to bottom and he didn't like the first item (which is assumed
> to
> > > be true if he looks at the second item, no idea if we really work that
> > > way), it makes sense to make a new shot with something completely
> > > different. and so on and on for the next items. I think it's also a
> topic
> > > for search engine results and i remember somethink like
> > "lipschitz-bandits"
> > > from my googling, but that is way above of what i am capable of doing.
> I
> > > just recognized that both amazon and netflix present multiple
> > > recommendation lists grouped by category, so in a way it's similar to
> > > search engine result clustering.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning 
> > wrote:
> > >
> > >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
> > >> johannes.schu...@gmail.com> wrote:
> > >>
> > >>> Thanks for the list...as a non native speaker I got problems
> > >> understanding
> > >>> the meaning of dithering here.
> > >>
> > >> Sorry about that.  Your English is good enough that I hadn't noticed
> any
> > >> deficit.
> > >>
> > >> Dithering is constructive mixing of the recommendation results.  The
> > idea
> > >> is to reorder the top results only slightly and the deeper results
> more
> > so.
> > >> There are several good effects and one (slightly) bad one.
> > >>
> > >> The good effects are:
> > >>
> > >> a) there is much less of a sharp boundary at the end of the first page
> > of
> > >> results.  This makes the statistics of the recommender work better and
> > also
> > >> helps the recommender not get stuck recommending just the things which
> > >> already appear on the first page.
> > >>
> > >> b) results that are very deep in the results can still be shown
> > >> occasionally.  This means that if the rec engine has even a hint that
> > >> something is good, it has a chance of increasing the ranking by
> > gathering
> > >> more data.  This is a bit different from (a)
> > >>
> > >> c) (bonus benefit) users like seeing novel things.  Even if they have
> > done
> > >> nothing to change their recommendations, they like seeing that you
> have
> > >> changed something so they keep coming back to the recommendation page.
> > >>
> > >> The major bad effect is that you are purposely decreasing relevance in
> > the
> > >> short term in order to get more information that will improve
> relevance
> > in
> > >> the long term.  The improvements dramatically outweigh this small
> > problem.
> > >>
> > >>
> > >>> I got the feeling that somewhere between a) and d) there is also
> > >>> diversification of items in the recommendation list, so increasing
> the
> > >>> distance between the list items according to some metric like tf/idf
> on
> > >>> item information. Never tried that, but with lucene / solr it should
> be
> > >>> possible to use this information during scoring..
> > >>
> > >> Yes.  But no

Re: Which database should I use with Mahout

2013-05-22 Thread Johannes Schulte
Yeah, that is what i had in mind as a simple solution. For examining bigger
result sets I always fear the cost of loading a lot of stored fields,
that's why i thought of including it in the scoring might be cool. It's not
possible with a plain Collector that maintains a priorityqueue of docid and
score but there should be some smart way to maintain a coordinate of the
top ranking items so far. Anyway, that's a different story

Thanks for the help so far!


On Thu, May 23, 2013 at 2:14 AM, Ted Dunning  wrote:

> Yes what you are describing with diversification is something that I have
> called anti-flood. It comes from the fact that we really are optimizing a
> portfolio of recommendations rather than a batch of independent
> recommendations. Doing this from first principles is very hard but there
> are very simple heuristics that do well in many practical situations.  For
> instance, simply penalizing the score of items based on how many items
> ranked above them that they are excessively close to does wonders.
>
> Sent from my iPhone
>
> On May 22, 2013, at 13:04, Johannes Schulte 
> wrote:
>
> > Okay i got it! I also always have used a basic form of dithering but we
> > always called it shuffle since it basically was / is Collections.shuffle
> on
> > a bigger list of results and therefore doesnt take the rank or score into
> > account. Will try that..
> >
> > With diversification i really meant more something that guarantees that
> you
> > maximize the intra-item distance (dithering often does but not on
> purpose).
> > The theory is, if i remember correctly, that if a user goes over the list
> > from top to bottom and he didn't like the first item (which is assumed to
> > be true if he looks at the second item, no idea if we really work that
> > way), it makes sense to make a new shot with something completely
> > different. and so on and on for the next items. I think it's also a topic
> > for search engine results and i remember somethink like
> "lipschitz-bandits"
> > from my googling, but that is way above of what i am capable of doing. I
> > just recognized that both amazon and netflix present multiple
> > recommendation lists grouped by category, so in a way it's similar to
> > search engine result clustering.
> >
> >
> >
> >
> >
> > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning 
> wrote:
> >
> >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
> >> johannes.schu...@gmail.com> wrote:
> >>
> >>> Thanks for the list...as a non native speaker I got problems
> >> understanding
> >>> the meaning of dithering here.
> >>
> >> Sorry about that.  Your English is good enough that I hadn't noticed any
> >> deficit.
> >>
> >> Dithering is constructive mixing of the recommendation results.  The
> idea
> >> is to reorder the top results only slightly and the deeper results more
> so.
> >> There are several good effects and one (slightly) bad one.
> >>
> >> The good effects are:
> >>
> >> a) there is much less of a sharp boundary at the end of the first page
> of
> >> results.  This makes the statistics of the recommender work better and
> also
> >> helps the recommender not get stuck recommending just the things which
> >> already appear on the first page.
> >>
> >> b) results that are very deep in the results can still be shown
> >> occasionally.  This means that if the rec engine has even a hint that
> >> something is good, it has a chance of increasing the ranking by
> gathering
> >> more data.  This is a bit different from (a)
> >>
> >> c) (bonus benefit) users like seeing novel things.  Even if they have
> done
> >> nothing to change their recommendations, they like seeing that you have
> >> changed something so they keep coming back to the recommendation page.
> >>
> >> The major bad effect is that you are purposely decreasing relevance in
> the
> >> short term in order to get more information that will improve relevance
> in
> >> the long term.  The improvements dramatically outweigh this small
> problem.
> >>
> >>
> >>> I got the feeling that somewhere between a) and d) there is also
> >>> diversification of items in the recommendation list, so increasing the
> >>> distance between the list items according to some metric like tf/idf on
> >>> item information. Never tried that, but with lucene / solr it should be
> >>> possible to use this information during scoring..
> >>
> >> Yes.  But no.
> >>
> >> This can be done at the presentation tier entirely.  I often do it by
> >> defining a score based solely on rank, typically something like log(r).
>  I
> >> add small amounts of noise to this synthetic score, often distributed
> >> exponentially with a small mean.  Then I sort the results according to
> this
> >> sum.
> >>
> >> Here are some simulated results computed using R:
> >>
> >>> (order((log(r) - runif(500, max=2)))[1:20])
> >> [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
> >> [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
> >> [1]  3  1  5  2 10  4  8  7 14 2

Re: Which database should I use with Mahout

2013-05-22 Thread Ted Dunning
Yes what you are describing with diversification is something that I have 
called anti-flood. It comes from the fact that we really are optimizing a 
portfolio of recommendations rather than a batch of independent 
recommendations. Doing this from first principles is very hard but there are 
very simple heuristics that do well in many practical situations.  For 
instance, simply penalizing the score of items based on how many items ranked 
above them that they are excessively close to does wonders. 

Sent from my iPhone

On May 22, 2013, at 13:04, Johannes Schulte  wrote:

> Okay i got it! I also always have used a basic form of dithering but we
> always called it shuffle since it basically was / is Collections.shuffle on
> a bigger list of results and therefore doesnt take the rank or score into
> account. Will try that..
> 
> With diversification i really meant more something that guarantees that you
> maximize the intra-item distance (dithering often does but not on purpose).
> The theory is, if i remember correctly, that if a user goes over the list
> from top to bottom and he didn't like the first item (which is assumed to
> be true if he looks at the second item, no idea if we really work that
> way), it makes sense to make a new shot with something completely
> different. and so on and on for the next items. I think it's also a topic
> for search engine results and i remember somethink like "lipschitz-bandits"
> from my googling, but that is way above of what i am capable of doing. I
> just recognized that both amazon and netflix present multiple
> recommendation lists grouped by category, so in a way it's similar to
> search engine result clustering.
> 
> 
> 
> 
> 
> On Wed, May 22, 2013 at 8:30 AM, Ted Dunning  wrote:
> 
>> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
>> johannes.schu...@gmail.com> wrote:
>> 
>>> Thanks for the list...as a non native speaker I got problems
>> understanding
>>> the meaning of dithering here.
>> 
>> Sorry about that.  Your English is good enough that I hadn't noticed any
>> deficit.
>> 
>> Dithering is constructive mixing of the recommendation results.  The idea
>> is to reorder the top results only slightly and the deeper results more so.
>> There are several good effects and one (slightly) bad one.
>> 
>> The good effects are:
>> 
>> a) there is much less of a sharp boundary at the end of the first page of
>> results.  This makes the statistics of the recommender work better and also
>> helps the recommender not get stuck recommending just the things which
>> already appear on the first page.
>> 
>> b) results that are very deep in the results can still be shown
>> occasionally.  This means that if the rec engine has even a hint that
>> something is good, it has a chance of increasing the ranking by gathering
>> more data.  This is a bit different from (a)
>> 
>> c) (bonus benefit) users like seeing novel things.  Even if they have done
>> nothing to change their recommendations, they like seeing that you have
>> changed something so they keep coming back to the recommendation page.
>> 
>> The major bad effect is that you are purposely decreasing relevance in the
>> short term in order to get more information that will improve relevance in
>> the long term.  The improvements dramatically outweigh this small problem.
>> 
>> 
>>> I got the feeling that somewhere between a) and d) there is also
>>> diversification of items in the recommendation list, so increasing the
>>> distance between the list items according to some metric like tf/idf on
>>> item information. Never tried that, but with lucene / solr it should be
>>> possible to use this information during scoring..
>> 
>> Yes.  But no.
>> 
>> This can be done at the presentation tier entirely.  I often do it by
>> defining a score based solely on rank, typically something like log(r).  I
>> add small amounts of noise to this synthetic score, often distributed
>> exponentially with a small mean.  Then I sort the results according to this
>> sum.
>> 
>> Here are some simulated results computed using R:
>> 
>>> (order((log(r) - runif(500, max=2)))[1:20])
>> [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
>> [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
>> [1]  3  1  5  2 10  4  8  7 14 21 19 26 29 13 27 15  6 12 33  9
>> [1]  1  2  5  3  6 17  4 20 18  7 19  9 25  8 29 21 15 27 28 12
>> [1]  1  2  5  3  7  4  8 11  9 15 10  6 33 37 17 27 36 16 34 38
>> [1]  1  4  2  5  9  3 14 13 12 17 22 25  7 15 18 36 16  6 20 29
>> [1]  1  3  4  7  2  6  5 12 18 17 13 24 27 10  8 20 14 34  9 46
>> [1]  3  1  2  6 12  8  7  5  4 19 11 26 10 15 28 35  9 20 42 25
>> 
>> As you can see, the first four results are commonly single digits.  This
>> comes about because the uniform noise that I have subtracted from the log
>> can only make a difference of 2 to the log with is equivalent of changing
>> the rank by a factor of about 7. If we were to use different noise
>> distributions we would get

Re: Which database should I use with Mahout

2013-05-22 Thread Pat Ferrel
This data was for a mobile shopping app. Other answers below.

> On May 21, 2013, at 5:42 PM, Ted Dunning  wrote:
> 
> Inline
> 
> 
> On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel  wrote:
> 
>> In the interest of getting some empirical data out about various
>> architectures:
>> 
>> On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel  wrote:
>> 
 ...
 You use the user history vector as a query?
>>> 
>>> The most recent suffix of the history vector.  How much is used varies by
>>> the purpose.
>> 
>> We did some experiments with this using a year+ of e-com data. We measured
>> the precision using different amounts of the history vector in 3 month
>> increments. The precision increased throughout the year. At about 9 months
>> the affects of what appears to be item/product/catalog/new model churn
>> begin to become significant and so precision started to level off. We did
>> *not* apply a filter to recs so that items not in the current catalog were
>> not filtered before precision was measured. We'd expect this to improve
>> results using older data.
>> 
> 
> This is a time filter.  How many transactions did this turn out to be.  I
> typically recommend truncating based on transactions rather than time.
> 
> My own experience was music and video recommendations.  Long history
> definitely did not help much there.
> 

This is what I've heard before; recommending music and media has it's own set 
of characteristics. We were at the point of looking at history on segments of 
the catalog (music vs food etc.) to do the same analysis. I suspect we would 
have found what you are saying. 

It would be a bit of processing to save only so many transactions for each 
user, certainly not impossible. Logs come in by time increments so we got new 
ones periodically but didn't count each user's transactions.

In any case the item similarity type recs were quite a bit more predictive of 
purchases than those based on user specific history, which changes the 
requirements a bit. We always measured precision on both history and similarity 
based reqs though and a blend of both got the best score. 

I don't have access to the data now--I sure wish we had some to share so these 
issues could be investigated and compared. 

> 
>>> 
>>> 20 recs is not sufficient.  Typically you need 300 for any given context
>>> and you need to recompute those very frequently.  If you use geo-specific
>>> recommendations, you may need thousands of recommendations to have enough
>>> geo-dispersion.  The search engine approach can handle all of that on the
>>> fly.
>>> 
>>> Also, the cached recs are user x (20-300) non-zeros.  The sparsified
>>> item-item cooccurrence matrix is item x 50.  Moreover, search engines are
>>> very good at compression.  If users >> items, then item x 50 is much
>>> smaller, especially after high quality compression (6:1 is a common
>>> compression ratio).
>>> 
>> 
>> The end application designed by the ecom customer required less than 10
>> recs for any given context so 20 gave us of room for runtime context type
>> boosting.
>> 
> 
> And how do you generate the next page of results?
> 

Its was a mobile app so it did not have a next page of results, it was an app 
design issue we had no control over. Actually Amazon on their web site uses 
only 100 in a horizontally scrolling strip, we had much less space to fill.

But regarding saving more reqs--most of my experience was in storing the entire 
recommendation matrix for a slightly different purpose. I was working on 
building the cross-recommender, which (as you know) is an ensemble of models 
where you have to learn weights for each part of the model. To do a linear 
combination, without knowing the weight ahead of time, you need virtually all 
recs for each query. I never got to finish that so I've reproduced the code but 
now, as I said, lack the data.

From all the research I did into the predictive power of various actions, the 
cross-recommender seemed to hold the most promise for cleaning one action using 
another more predictive action. If I can prove this out then a whole range of 
multi-action chains present themselves. At very least we have the framework for 
creating and learning the weights for small ensembles.



Re: Which database should I use with Mahout

2013-05-22 Thread Johannes Schulte
Okay i got it! I also always have used a basic form of dithering but we
always called it shuffle since it basically was / is Collections.shuffle on
a bigger list of results and therefore doesnt take the rank or score into
account. Will try that..

With diversification i really meant more something that guarantees that you
maximize the intra-item distance (dithering often does but not on purpose).
The theory is, if i remember correctly, that if a user goes over the list
from top to bottom and he didn't like the first item (which is assumed to
be true if he looks at the second item, no idea if we really work that
way), it makes sense to make a new shot with something completely
different. and so on and on for the next items. I think it's also a topic
for search engine results and i remember somethink like "lipschitz-bandits"
from my googling, but that is way above of what i am capable of doing. I
just recognized that both amazon and netflix present multiple
recommendation lists grouped by category, so in a way it's similar to
search engine result clustering.





On Wed, May 22, 2013 at 8:30 AM, Ted Dunning  wrote:

> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > Thanks for the list...as a non native speaker I got problems
> understanding
> > the meaning of dithering here.
> >
>
> Sorry about that.  Your English is good enough that I hadn't noticed any
> deficit.
>
> Dithering is constructive mixing of the recommendation results.  The idea
> is to reorder the top results only slightly and the deeper results more so.
>  There are several good effects and one (slightly) bad one.
>
> The good effects are:
>
> a) there is much less of a sharp boundary at the end of the first page of
> results.  This makes the statistics of the recommender work better and also
> helps the recommender not get stuck recommending just the things which
> already appear on the first page.
>
> b) results that are very deep in the results can still be shown
> occasionally.  This means that if the rec engine has even a hint that
> something is good, it has a chance of increasing the ranking by gathering
> more data.  This is a bit different from (a)
>
> c) (bonus benefit) users like seeing novel things.  Even if they have done
> nothing to change their recommendations, they like seeing that you have
> changed something so they keep coming back to the recommendation page.
>
> The major bad effect is that you are purposely decreasing relevance in the
> short term in order to get more information that will improve relevance in
> the long term.  The improvements dramatically outweigh this small problem.
>
>
> > I got the feeling that somewhere between a) and d) there is also
> > diversification of items in the recommendation list, so increasing the
> > distance between the list items according to some metric like tf/idf on
> > item information. Never tried that, but with lucene / solr it should be
> > possible to use this information during scoring..
> >
>
> Yes.  But no.
>
> This can be done at the presentation tier entirely.  I often do it by
> defining a score based solely on rank, typically something like log(r).  I
> add small amounts of noise to this synthetic score, often distributed
> exponentially with a small mean.  Then I sort the results according to this
> sum.
>
> Here are some simulated results computed using R:
>
> > (order((log(r) - runif(500, max=2)))[1:20])
>  [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
>  [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
>  [1]  3  1  5  2 10  4  8  7 14 21 19 26 29 13 27 15  6 12 33  9
>  [1]  1  2  5  3  6 17  4 20 18  7 19  9 25  8 29 21 15 27 28 12
>  [1]  1  2  5  3  7  4  8 11  9 15 10  6 33 37 17 27 36 16 34 38
>  [1]  1  4  2  5  9  3 14 13 12 17 22 25  7 15 18 36 16  6 20 29
>  [1]  1  3  4  7  2  6  5 12 18 17 13 24 27 10  8 20 14 34  9 46
>  [1]  3  1  2  6 12  8  7  5  4 19 11 26 10 15 28 35  9 20 42 25
>
> As you can see, the first four results are commonly single digits.  This
> comes about because the uniform noise that I have subtracted from the log
> can only make a difference of 2 to the log with is equivalent of changing
> the rank by a factor of about 7. If we were to use different noise
> distributions we would get somewhat different kinds of perturbation.  For
> instance, using exponentially distributed noise gives mostly tame results
> with some real surprises:
>
> > (order((log(r) - 0.3*rexp(500)))[1:20])
>  [1]  1  2  3  8  4  5  9  6  7 25 14 11 13 24 10 31 34 12 22 21
>  [1]  1  2  5  4  3  6  7 12  8 10  9 17 13 11 14 25 64 15 47 19
>  [1]  1  2  3  4  5  6  7 10  8  9 11 21 13 12 15 16 14 25 18 33
>  [1]  1  2  3 10  4  5  7 14  6  8 13  9 15 25 16 11 20 12 17 54
>  [1]  1  3  2  4  7  5  6  8 11 23  9 32 18 10 13 15 12 48 14 19
>  [1]  1  3  2  4  5 10 12  6  9  7  8 18 16 17 11 13 25 14 15 19
>  [1]  6  1  2  4  3  5  9 11  7 15  8 10 14 12 19 16 13 25 39 18
>  [1]  1  2  3  

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Thanks for the list...as a non native speaker I got problems understanding
> the meaning of dithering here.
>

Sorry about that.  Your English is good enough that I hadn't noticed any
deficit.

Dithering is constructive mixing of the recommendation results.  The idea
is to reorder the top results only slightly and the deeper results more so.
 There are several good effects and one (slightly) bad one.

The good effects are:

a) there is much less of a sharp boundary at the end of the first page of
results.  This makes the statistics of the recommender work better and also
helps the recommender not get stuck recommending just the things which
already appear on the first page.

b) results that are very deep in the results can still be shown
occasionally.  This means that if the rec engine has even a hint that
something is good, it has a chance of increasing the ranking by gathering
more data.  This is a bit different from (a)

c) (bonus benefit) users like seeing novel things.  Even if they have done
nothing to change their recommendations, they like seeing that you have
changed something so they keep coming back to the recommendation page.

The major bad effect is that you are purposely decreasing relevance in the
short term in order to get more information that will improve relevance in
the long term.  The improvements dramatically outweigh this small problem.


> I got the feeling that somewhere between a) and d) there is also
> diversification of items in the recommendation list, so increasing the
> distance between the list items according to some metric like tf/idf on
> item information. Never tried that, but with lucene / solr it should be
> possible to use this information during scoring..
>

Yes.  But no.

This can be done at the presentation tier entirely.  I often do it by
defining a score based solely on rank, typically something like log(r).  I
add small amounts of noise to this synthetic score, often distributed
exponentially with a small mean.  Then I sort the results according to this
sum.

Here are some simulated results computed using R:

> (order((log(r) - runif(500, max=2)))[1:20])
 [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
 [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
 [1]  3  1  5  2 10  4  8  7 14 21 19 26 29 13 27 15  6 12 33  9
 [1]  1  2  5  3  6 17  4 20 18  7 19  9 25  8 29 21 15 27 28 12
 [1]  1  2  5  3  7  4  8 11  9 15 10  6 33 37 17 27 36 16 34 38
 [1]  1  4  2  5  9  3 14 13 12 17 22 25  7 15 18 36 16  6 20 29
 [1]  1  3  4  7  2  6  5 12 18 17 13 24 27 10  8 20 14 34  9 46
 [1]  3  1  2  6 12  8  7  5  4 19 11 26 10 15 28 35  9 20 42 25

As you can see, the first four results are commonly single digits.  This
comes about because the uniform noise that I have subtracted from the log
can only make a difference of 2 to the log with is equivalent of changing
the rank by a factor of about 7. If we were to use different noise
distributions we would get somewhat different kinds of perturbation.  For
instance, using exponentially distributed noise gives mostly tame results
with some real surprises:

> (order((log(r) - 0.3*rexp(500)))[1:20])
 [1]  1  2  3  8  4  5  9  6  7 25 14 11 13 24 10 31 34 12 22 21
 [1]  1  2  5  4  3  6  7 12  8 10  9 17 13 11 14 25 64 15 47 19
 [1]  1  2  3  4  5  6  7 10  8  9 11 21 13 12 15 16 14 25 18 33
 [1]  1  2  3 10  4  5  7 14  6  8 13  9 15 25 16 11 20 12 17 54
 [1]  1  3  2  4  7  5  6  8 11 23  9 32 18 10 13 15 12 48 14 19
 [1]  1  3  2  4  5 10 12  6  9  7  8 18 16 17 11 13 25 14 15 19
 [1]  6  1  2  4  3  5  9 11  7 15  8 10 14 12 19 16 13 25 39 18
 [1]  1  2  3  4 30  5  7  6  9  8 16 11 10 15 12 13 37 14 31 23
 [1]  1  2  3  4  9 16  5  6  8  7 10 13 11 17 15 19 12 20 14 26
 [1]  1  2  3 13  5  4  7  6  8 15 12 11  9 10 36 14 24 70 19 16
 [1]   1   2   6   3   5   4  11  22   7   9 250   8  10  15  12  17 13  40
 16  14




> Have a nice day
>
>
>
>
> On Wed, May 22, 2013 at 2:30 AM, Ted Dunning 
> wrote:
>
> > I have so far just used the weights that Solr applies natively.
> >
> > In my experience, what makes a recommendation engine work better is, in
> > order of importance,
> >
> > a) dithering so that you gather wider data
> >
> > b) using multiple sources of input
> >
> > c) returning results quickly and reliably
> >
> > d) the actual algorithm or weighting scheme
> >
> > If you can cover items a-c in a real business, you are very lucky.  The
> > search engine approach handles (b) and (c) by nature which massively
> > improves the likelihood of ever getting to examine (d).
> >
> >
> > On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte <
> > johannes.schu...@gmail.com> wrote:
> >
> > > Thanks! Could you also add how to learn the weights you talked about,
> or
> > at
> > > least a hint? Learning weights for search engine query terms always
> > sounds
> > > like  "learning to rank" to me but this always seemed p

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Thanks for the list...as a non native speaker I got problems understanding
the meaning of dithering here.

I got the feeling that somewhere between a) and d) there is also
diversification of items in the recommendation list, so increasing the
distance between the list items according to some metric like tf/idf on
item information. Never tried that, but with lucene / solr it should be
possible to use this information during scoring..

Have a nice day




On Wed, May 22, 2013 at 2:30 AM, Ted Dunning  wrote:

> I have so far just used the weights that Solr applies natively.
>
> In my experience, what makes a recommendation engine work better is, in
> order of importance,
>
> a) dithering so that you gather wider data
>
> b) using multiple sources of input
>
> c) returning results quickly and reliably
>
> d) the actual algorithm or weighting scheme
>
> If you can cover items a-c in a real business, you are very lucky.  The
> search engine approach handles (b) and (c) by nature which massively
> improves the likelihood of ever getting to examine (d).
>
>
> On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > Thanks! Could you also add how to learn the weights you talked about, or
> at
> > least a hint? Learning weights for search engine query terms always
> sounds
> > like  "learning to rank" to me but this always seemed pretty complicated
> > and i never managed to try it out..
> >
> >
>


Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
Inline


On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel  wrote:

> In the interest of getting some empirical data out about various
> architectures:
>
> On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel  wrote:
>
> >> ...
> >> You use the user history vector as a query?
> >
> > The most recent suffix of the history vector.  How much is used varies by
> > the purpose.
>
> We did some experiments with this using a year+ of e-com data. We measured
> the precision using different amounts of the history vector in 3 month
> increments. The precision increased throughout the year. At about 9 months
> the affects of what appears to be item/product/catalog/new model churn
> begin to become significant and so precision started to level off. We did
> *not* apply a filter to recs so that items not in the current catalog were
> not filtered before precision was measured. We'd expect this to improve
> results using older data.
>

This is a time filter.  How many transactions did this turn out to be.  I
typically recommend truncating based on transactions rather than time.

My own experience was music and video recommendations.  Long history
definitely did not help much there.


> >
> > 20 recs is not sufficient.  Typically you need 300 for any given context
> > and you need to recompute those very frequently.  If you use geo-specific
> > recommendations, you may need thousands of recommendations to have enough
> > geo-dispersion.  The search engine approach can handle all of that on the
> > fly.
> >
> > Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> > item-item cooccurrence matrix is item x 50.  Moreover, search engines are
> > very good at compression.  If users >> items, then item x 50 is much
> > smaller, especially after high quality compression (6:1 is a common
> > compression ratio).
> >
>
> The end application designed by the ecom customer required less than 10
> recs for any given context so 20 gave us of room for runtime context type
> boosting.
>

And how do you generate the next page of results?


> Given that precision increased for a year of user history and that we
> needed to return 20 recs per user and per items the history matrix was
> nearly 2 orders of magnitude larger than the recs matrix. This was with
> about 5M users and 500K items over a year.


The history matrix should be at most 2.5 T bits = 300GB.  Remember, this is
a binary matrix that is relatively sparse so I would expect that a typical
size would be more like a gigabyte.


> The issue I was asking about was how to store and retrieve history vectors
> for queries. In our case it looks like some kind of scalable persistence
> store would be required and since pre-calculated reqs are indeed much
> smaller...
>

I am still confused about this assertion.  I think that you need <500
history items per person which is about 500 * 19bits < 1.3KB / user.  I
also think that you need 100 or more recs if you prestore them which is
also in the kilobyte range.  This doesn't sound all that different.

And then the search index needs to store 500K x 50 nonzeros = 100 MB.  This
is much smaller than either the history or the prestored recommendations
even before any compression.



> Yes using a search engine the index is very small but the history vectors
> are not. Actually I wonder how well Solr would handle a large query? Is the
> truncation of the history vector required perhaps?
>

The history vector is rarely more than a hundred terms which is not that
large a query.


> > Actually, it is.  Round trip of less than 10ms is common.  Precalculation
> > goes away.  Export of recs nearly goes away.  Currency of recommendations
> > is much higher.
>
> This is certainly great performance, no doubt. Using a 12 node Cassandra
> ring (each machine had 16G of memory) spread across two geolocations we got
> 24,000 tps to a worst case of 5000 tps. The average response for the entire
> system (which included two internal service layers and one query to
> cassandra) was 5-10ms per response.
>

Uh... my numbers are for a single node.  Query rates are typically about
1-2K queries/second so the speed is comparable.


Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
I have so far just used the weights that Solr applies natively.

In my experience, what makes a recommendation engine work better is, in
order of importance,

a) dithering so that you gather wider data

b) using multiple sources of input

c) returning results quickly and reliably

d) the actual algorithm or weighting scheme

If you can cover items a-c in a real business, you are very lucky.  The
search engine approach handles (b) and (c) by nature which massively
improves the likelihood of ever getting to examine (d).


On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Thanks! Could you also add how to learn the weights you talked about, or at
> least a hint? Learning weights for search engine query terms always sounds
> like  "learning to rank" to me but this always seemed pretty complicated
> and i never managed to try it out..
>
>


Re: Which database should I use with Mahout

2013-05-21 Thread Pat Ferrel
In the interest of getting some empirical data out about various architectures:

On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel  wrote:

>> ...
>> You use the user history vector as a query?
> 
> The most recent suffix of the history vector.  How much is used varies by
> the purpose.

We did some experiments with this using a year+ of e-com data. We measured the 
precision using different amounts of the history vector in 3 month increments. 
The precision increased throughout the year. At about 9 months the affects of 
what appears to be item/product/catalog/new model churn begin to become 
significant and so precision started to level off. We did *not* apply a filter 
to recs so that items not in the current catalog were not filtered before 
precision was measured. We'd expect this to improve results using older data.

In our case we never found a good truncation point though it looked like we 
were reaching one when the data ran out. Even the last 3 months produced a 4.5% 
better precision score.

> 
>> ...
>> Seems like you'd rule out browser based storage because you need the
>> history to train your next model. At least it would be in addition to a 
>> server based storage of history.
> 
> Yes.  In addition to.
> 
>> The user history matrix will be quite a bit larger than the user
>> recommendation matrix, maybe an order or two larger.
> 
> 
> I don't think so.  And it doesn't matter since this is reduced to
> significant cooccurrence and that is typically quite small compared to a
> list of recommendations for all users.
> 
>> I have 20 recs for me stored but I've purchases 100's of items, and have
>> viewed 1000's.
>> 
> 
> 20 recs is not sufficient.  Typically you need 300 for any given context
> and you need to recompute those very frequently.  If you use geo-specific
> recommendations, you may need thousands of recommendations to have enough
> geo-dispersion.  The search engine approach can handle all of that on the
> fly.
> 
> Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> item-item cooccurrence matrix is item x 50.  Moreover, search engines are
> very good at compression.  If users >> items, then item x 50 is much
> smaller, especially after high quality compression (6:1 is a common
> compression ratio).
> 

The end application designed by the ecom customer required less than 10 recs 
for any given context so 20 gave us of room for runtime context type boosting.

Given that precision increased for a year of user history and that we needed to 
return 20 recs per user and per items the history matrix was nearly 2 orders of 
magnitude larger than the recs matrix. This was with about 5M users and 500K 
items over a year. The issue I was asking about was how to store and retrieve 
history vectors for queries. In our case it looks like some kind of scalable 
persistence store would be required and since pre-calculated reqs are indeed 
much smaller...

I fully believe your description of how well search engines store their index. 
The cooccurrence matrix is already sparsified by a similarity metric and any 
compression that Solr does will help keep the index small. In any case Solr 
does sharding so it can scale past one machine anyway.
 
>> 
>> Given that you have to have the entire user history vector to do the query
>> and given that this is still a lookup from an even larger matrix than the
>> recs/user matrix and given that you have to do the lookup before the Solr
>> query It can't be faster than just looking up pre-calculated recs.
> 
> 
> None of this applies.  There is an item x 50 sized search index.  There is
> a recent history that is available without a lookup.  All that is required
> is a single Solr query and that can handle multiple kinds of history and
> geo-location and user search terms all in a single step.
> 

Yes using a search engine the index is very small but the history vectors are 
not. Actually I wonder how well Solr would handle a large query? Is the 
truncation of the history vector required perhaps?

>> Something here may be "orders of magnitude" faster, but it isn't the total
>> elapsed time to return recs at runtime, right?
> 
> 
> Actually, it is.  Round trip of less than 10ms is common.  Precalculation
> goes away.  Export of recs nearly goes away.  Currency of recommendations
> is much higher.

This is certainly great performance, no doubt. Using a 12 node Cassandra ring 
(each machine had 16G of memory) spread across two geolocations we got 24,000 
tps to a worst case of 5000 tps. The average response for the entire system 
(which included two internal service layers and one query to cassandra) was 
5-10ms per response.

>> 
>> Maybe what you are saying is the time to pre-calculate the recs is 0 since
>> they are calculated at runtime but you still have to create the
>> cooccurrence matrix so you still need something like mahout hadoop to
>> produce a model and you still need to index the model with Solr and you
>> still need to lookup user his

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Thanks! Could you also add how to learn the weights you talked about, or at
least a hint? Learning weights for search engine query terms always sounds
like  "learning to rank" to me but this always seemed pretty complicated
and i never managed to try it out..


On Tue, May 21, 2013 at 8:01 AM, Ted Dunning  wrote:

> Johannes,
>
> Your summary is good.
>
> I would add that the precalculated recommendations can be large enough that
> the lookup becomes more expensive.  Your point about staleness is very
> on-point.
>
>
> On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > I think Pat is just saying that
> > time(history_lookup) (1) + time (recommendation_calculation) (2) >
> > time(precalc_lookop) (3)
> >
> > since 1 and 3 are assumed to be served by the same system class (key
> value
> > store, db) with a single key and 2 > 0.
> >
> > ed is using a lot of information that is available at recommendation time
> > and not fetched from a somewhere ("context of delivery", geolocation).
> The
> > question remaining is why the recent history is available without a
> lookup,
> > which can only be the case if the recommendation calculation is embedded
> in
> > a bigger request cycle the history is loaded somewhere else, or it's just
> > stored in the browser.
> >
> > if you would store the classical (netflix/mahout) user-item history in
> the
> > browser and use a disk matrix thing like lucene for calculation you would
> > end up in the same range.
> >
> > I think the points are more:
> >
> >
> > 1. Having more input's than the classical item-interactions
> > (geolocation->item,search_term->item ..) can be very easily carried out
> > with search index storing this precalculated "association rules"
> >
> > 2. Precalculation per user is heavyweight, stale and hard to do if the
> > context also plays a role (site the use is on e.g because you have to
> have
> > the cartesian product of recommendations prepared for every user), while
> > "real time" approach can handle it
> >
> >
> >
> >
> >
> > On Tue, May 21, 2013 at 2:00 AM, Ted Dunning 
> > wrote:
> >
> > > Inline answers.
> > >
> > >
> > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel 
> > wrote:
> > >
> > > > ...
> > > > You use the user history vector as a query?
> > >
> > >
> > > The most recent suffix of the history vector.  How much is used varies
> by
> > > the purpose.
> > >
> > >
> > > > This will be a list of item IDs and strength-of-preference values
> > (maybe
> > > > 1s for purchases).
> > >
> > >
> > > Just a list of item x action codes.  No strength needed.  If you have 5
> > > point ratings, then you can have 5 actions for each item.  The
> weighting
> > > for each action can be learned.
> > >
> > >
> > > > The cooccurrence matrix has columns treated like terms and rows
> treated
> > > > like documents though both are really items.
> > >
> > >
> > > Well, they are different.  The rows are fields within documents
> > associated
> > > with an item.  Other fields include ID and other things.  The contents
> of
> > > the field are the codes associated with the item-action pairs for each
> > > non-null column.  Usually there is only one action so this reduces to a
> > > single column per item.
> > >
> > >
> > >
> > >
> > > > Does Solr support weighted term lists as queries or do you have to
> > throw
> > > > out strength-of-preference?
> > >
> > >
> > > I prefer to throw it out even though Solr would not require me to do
> so.
> > >  They weights that I want can be encoded in the document index in any
> > case.
> > >
> > >
> > > > I ask because there are cases where the query will have non '1.0'
> > values.
> > > > When the strength values are just 1 the vector is really only a list
> or
> > > > terms (items IDs).
> > > >
> > >
> > > I really don't know of any cases where this is really true.  There are
> > > actions that are categorical.  I like to separate them out or to reduce
> > to
> > > a binary case.
> > >
> > >
> > > >
> > > > This technique seems like using a doc as a query but you have reduced
> > the
> > > > doc to the form of a vector of weighted terms. I was unaware that
> Solr
> > > > allowed weighted term queries. This is really identical to using Solr
> > for
> > > > fast doc similarity queries.
> > > >
> > >
> > > It is really more like an ordinary query.  Typical recommendation
> queries
> > > are short since they are only recent history.
> > >
> > >
> > > >
> > > > ...
> > > >
> > > > Seems like you'd rule out browser based storage because you need the
> > > > history to train your next model.
> > >
> > >
> > > Nothing says that we can't store data in two places according to use.
> > >  Browser history is good for the part of the history that becomes the
> > > query.  Central storage is good for the mass of history that becomes
> > input
> > > for analytics.
> > >
> > > At least it would be in addition to a server based storage of history.
> > >
> > >
> > > Yes.  In addition to.
> > >
> > >
> > 

Re: Which database should I use with Mahout

2013-05-20 Thread Ted Dunning
Johannes,

Your summary is good.

I would add that the precalculated recommendations can be large enough that
the lookup becomes more expensive.  Your point about staleness is very
on-point.


On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> I think Pat is just saying that
> time(history_lookup) (1) + time (recommendation_calculation) (2) >
> time(precalc_lookop) (3)
>
> since 1 and 3 are assumed to be served by the same system class (key value
> store, db) with a single key and 2 > 0.
>
> ed is using a lot of information that is available at recommendation time
> and not fetched from a somewhere ("context of delivery", geolocation). The
> question remaining is why the recent history is available without a lookup,
> which can only be the case if the recommendation calculation is embedded in
> a bigger request cycle the history is loaded somewhere else, or it's just
> stored in the browser.
>
> if you would store the classical (netflix/mahout) user-item history in the
> browser and use a disk matrix thing like lucene for calculation you would
> end up in the same range.
>
> I think the points are more:
>
>
> 1. Having more input's than the classical item-interactions
> (geolocation->item,search_term->item ..) can be very easily carried out
> with search index storing this precalculated "association rules"
>
> 2. Precalculation per user is heavyweight, stale and hard to do if the
> context also plays a role (site the use is on e.g because you have to have
> the cartesian product of recommendations prepared for every user), while
> "real time" approach can handle it
>
>
>
>
>
> On Tue, May 21, 2013 at 2:00 AM, Ted Dunning 
> wrote:
>
> > Inline answers.
> >
> >
> > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel 
> wrote:
> >
> > > ...
> > > You use the user history vector as a query?
> >
> >
> > The most recent suffix of the history vector.  How much is used varies by
> > the purpose.
> >
> >
> > > This will be a list of item IDs and strength-of-preference values
> (maybe
> > > 1s for purchases).
> >
> >
> > Just a list of item x action codes.  No strength needed.  If you have 5
> > point ratings, then you can have 5 actions for each item.  The weighting
> > for each action can be learned.
> >
> >
> > > The cooccurrence matrix has columns treated like terms and rows treated
> > > like documents though both are really items.
> >
> >
> > Well, they are different.  The rows are fields within documents
> associated
> > with an item.  Other fields include ID and other things.  The contents of
> > the field are the codes associated with the item-action pairs for each
> > non-null column.  Usually there is only one action so this reduces to a
> > single column per item.
> >
> >
> >
> >
> > > Does Solr support weighted term lists as queries or do you have to
> throw
> > > out strength-of-preference?
> >
> >
> > I prefer to throw it out even though Solr would not require me to do so.
> >  They weights that I want can be encoded in the document index in any
> case.
> >
> >
> > > I ask because there are cases where the query will have non '1.0'
> values.
> > > When the strength values are just 1 the vector is really only a list or
> > > terms (items IDs).
> > >
> >
> > I really don't know of any cases where this is really true.  There are
> > actions that are categorical.  I like to separate them out or to reduce
> to
> > a binary case.
> >
> >
> > >
> > > This technique seems like using a doc as a query but you have reduced
> the
> > > doc to the form of a vector of weighted terms. I was unaware that Solr
> > > allowed weighted term queries. This is really identical to using Solr
> for
> > > fast doc similarity queries.
> > >
> >
> > It is really more like an ordinary query.  Typical recommendation queries
> > are short since they are only recent history.
> >
> >
> > >
> > > ...
> > >
> > > Seems like you'd rule out browser based storage because you need the
> > > history to train your next model.
> >
> >
> > Nothing says that we can't store data in two places according to use.
> >  Browser history is good for the part of the history that becomes the
> > query.  Central storage is good for the mass of history that becomes
> input
> > for analytics.
> >
> > At least it would be in addition to a server based storage of history.
> >
> >
> > Yes.  In addition to.
> >
> >
> > > Another reason you wouldn't rely only on a browser storage is that it
> > will
> > > be occasionally destroyed. Users span multiple devices these days too.
> > >
> >
> > This can be dealt with using cookie resurrection techniques.  Or by
> letting
> > the user destroy their copy of the history if they like.
> >
> > The user history matrix will be quite a bit larger than the user
> > > recommendation matrix, maybe and order or two larger.
> >
> >
> > I don't think so.  And it doesn't matter since this is reduced to
> > significant cooccurrence and that is typically quite small compared to a
> > list of recommenda

Re: Which database should I use with Mahout

2013-05-20 Thread Johannes Schulte
I think Pat is just saying that
time(history_lookup) (1) + time (recommendation_calculation) (2) >
time(precalc_lookop) (3)

since 1 and 3 are assumed to be served by the same system class (key value
store, db) with a single key and 2 > 0.

ed is using a lot of information that is available at recommendation time
and not fetched from a somewhere ("context of delivery", geolocation). The
question remaining is why the recent history is available without a lookup,
which can only be the case if the recommendation calculation is embedded in
a bigger request cycle the history is loaded somewhere else, or it's just
stored in the browser.

if you would store the classical (netflix/mahout) user-item history in the
browser and use a disk matrix thing like lucene for calculation you would
end up in the same range.

I think the points are more:


1. Having more input's than the classical item-interactions
(geolocation->item,search_term->item ..) can be very easily carried out
with search index storing this precalculated "association rules"

2. Precalculation per user is heavyweight, stale and hard to do if the
context also plays a role (site the use is on e.g because you have to have
the cartesian product of recommendations prepared for every user), while
"real time" approach can handle it





On Tue, May 21, 2013 at 2:00 AM, Ted Dunning  wrote:

> Inline answers.
>
>
> On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel  wrote:
>
> > ...
> > You use the user history vector as a query?
>
>
> The most recent suffix of the history vector.  How much is used varies by
> the purpose.
>
>
> > This will be a list of item IDs and strength-of-preference values (maybe
> > 1s for purchases).
>
>
> Just a list of item x action codes.  No strength needed.  If you have 5
> point ratings, then you can have 5 actions for each item.  The weighting
> for each action can be learned.
>
>
> > The cooccurrence matrix has columns treated like terms and rows treated
> > like documents though both are really items.
>
>
> Well, they are different.  The rows are fields within documents associated
> with an item.  Other fields include ID and other things.  The contents of
> the field are the codes associated with the item-action pairs for each
> non-null column.  Usually there is only one action so this reduces to a
> single column per item.
>
>
>
>
> > Does Solr support weighted term lists as queries or do you have to throw
> > out strength-of-preference?
>
>
> I prefer to throw it out even though Solr would not require me to do so.
>  They weights that I want can be encoded in the document index in any case.
>
>
> > I ask because there are cases where the query will have non '1.0' values.
> > When the strength values are just 1 the vector is really only a list or
> > terms (items IDs).
> >
>
> I really don't know of any cases where this is really true.  There are
> actions that are categorical.  I like to separate them out or to reduce to
> a binary case.
>
>
> >
> > This technique seems like using a doc as a query but you have reduced the
> > doc to the form of a vector of weighted terms. I was unaware that Solr
> > allowed weighted term queries. This is really identical to using Solr for
> > fast doc similarity queries.
> >
>
> It is really more like an ordinary query.  Typical recommendation queries
> are short since they are only recent history.
>
>
> >
> > ...
> >
> > Seems like you'd rule out browser based storage because you need the
> > history to train your next model.
>
>
> Nothing says that we can't store data in two places according to use.
>  Browser history is good for the part of the history that becomes the
> query.  Central storage is good for the mass of history that becomes input
> for analytics.
>
> At least it would be in addition to a server based storage of history.
>
>
> Yes.  In addition to.
>
>
> > Another reason you wouldn't rely only on a browser storage is that it
> will
> > be occasionally destroyed. Users span multiple devices these days too.
> >
>
> This can be dealt with using cookie resurrection techniques.  Or by letting
> the user destroy their copy of the history if they like.
>
> The user history matrix will be quite a bit larger than the user
> > recommendation matrix, maybe and order or two larger.
>
>
> I don't think so.  And it doesn't matter since this is reduced to
> significant cooccurrence and that is typically quite small compared to a
> list of recommendations for all users.
>
> I have 20 recs for me stored but I've purchases 100's of items, and have
> > viewed 1000's.
> >
>
> 20 recs is not sufficient.  Typically you need 300 for any given context
> and you need to recompute those very frequently.  If you use geo-specific
> recommendations, you may need thousands of recommendations to have enough
> geo-dispersion.  The search engine approach can handle all of that on the
> fly.
>
> Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> item-item cooccurrence matrix is item x 50.  Moreov

Re: Which database should I use with Mahout

2013-05-20 Thread Ted Dunning
Inline answers.


On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel  wrote:

> ...
> You use the user history vector as a query?


The most recent suffix of the history vector.  How much is used varies by
the purpose.


> This will be a list of item IDs and strength-of-preference values (maybe
> 1s for purchases).


Just a list of item x action codes.  No strength needed.  If you have 5
point ratings, then you can have 5 actions for each item.  The weighting
for each action can be learned.


> The cooccurrence matrix has columns treated like terms and rows treated
> like documents though both are really items.


Well, they are different.  The rows are fields within documents associated
with an item.  Other fields include ID and other things.  The contents of
the field are the codes associated with the item-action pairs for each
non-null column.  Usually there is only one action so this reduces to a
single column per item.




> Does Solr support weighted term lists as queries or do you have to throw
> out strength-of-preference?


I prefer to throw it out even though Solr would not require me to do so.
 They weights that I want can be encoded in the document index in any case.


> I ask because there are cases where the query will have non '1.0' values.
> When the strength values are just 1 the vector is really only a list or
> terms (items IDs).
>

I really don't know of any cases where this is really true.  There are
actions that are categorical.  I like to separate them out or to reduce to
a binary case.


>
> This technique seems like using a doc as a query but you have reduced the
> doc to the form of a vector of weighted terms. I was unaware that Solr
> allowed weighted term queries. This is really identical to using Solr for
> fast doc similarity queries.
>

It is really more like an ordinary query.  Typical recommendation queries
are short since they are only recent history.


>
> ...
>
> Seems like you'd rule out browser based storage because you need the
> history to train your next model.


Nothing says that we can't store data in two places according to use.
 Browser history is good for the part of the history that becomes the
query.  Central storage is good for the mass of history that becomes input
for analytics.

At least it would be in addition to a server based storage of history.


Yes.  In addition to.


> Another reason you wouldn't rely only on a browser storage is that it will
> be occasionally destroyed. Users span multiple devices these days too.
>

This can be dealt with using cookie resurrection techniques.  Or by letting
the user destroy their copy of the history if they like.

The user history matrix will be quite a bit larger than the user
> recommendation matrix, maybe and order or two larger.


I don't think so.  And it doesn't matter since this is reduced to
significant cooccurrence and that is typically quite small compared to a
list of recommendations for all users.

I have 20 recs for me stored but I've purchases 100's of items, and have
> viewed 1000's.
>

20 recs is not sufficient.  Typically you need 300 for any given context
and you need to recompute those very frequently.  If you use geo-specific
recommendations, you may need thousands of recommendations to have enough
geo-dispersion.  The search engine approach can handle all of that on the
fly.

Also, the cached recs are user x (20-300) non-zeros.  The sparsified
item-item cooccurrence matrix is item x 50.  Moreover, search engines are
very good at compression.  If users >> items, then item x 50 is much
smaller, especially after high quality compression (6:1 is a common
compression ratio).


>
> Given that you have to have the entire user history vector to do the query
> and given that this is still a lookup from an even larger matrix than the
> recs/user matrix and given that you have to do the lookup before the Solr
> query It can't be faster than just looking up pre-calculated recs.


None of this applies.  There is an item x 50 sized search index.  There is
a recent history that is available without a lookup.  All that is required
is a single Solr query and that can handle multiple kinds of history and
geo-location and user search terms all in a single step.



> In other words the query to produce the query will be more problematic
> than the query to produce the result, right?
>

Nope.  No such thing, therefore cost = 0.


> Something here may be "orders of magnitude" faster, but it isn't the total
> elapsed time to return recs at runtime, right?


Actually, it is.  Round trip of less than 10ms is common.  Precalculation
goes away.  Export of recs nearly goes away.  Currency of recommendations
is much higher.


> Maybe what you are saying is the time to pre-calculate the recs is 0 since
> they are calculated at runtime but you still have to create the
> cooccurrence matrix so you still need something like mahout hadoop to
> produce a model and you still need to index the model with Solr and you
> still need to look

Re: Which database should I use with Mahout

2013-05-20 Thread Ken Krugler
Hi Pat,

On May 20, 2013, at 9:46am, Pat Ferrel wrote:

> I certainly have questions about this architecture mentioned below but first 
> let me make sure I understand. 
> 
> You use the user history vector as a query? This will be a list of item IDs 
> and strength-of-preference values (maybe 1s for purchases). The cooccurrence 
> matrix has columns treated like terms and rows treated like documents though 
> both are really items. Does Solr support weighted term lists as queries

Yes - you can "boost" individual terms in the query.

And you can use payloads on terms in the index to adjust their scores as well.

-- Ken

> or do you have to throw out strength-of-preference? I ask because there are 
> cases where the query will have non '1.0' values. When the strength values 
> are just 1 the vector is really only a list or terms (items IDs).
> 
> This technique seems like using a doc as a query but you have reduced the doc 
> to the form of a vector of weighted terms. I was unaware that Solr allowed 
> weighted term queries. This is really identical to using Solr for fast doc 
> similarity queries.
> 
>>> Using a cooccurrence matrix means you are doing item similairty since
>>> there is no user data in the matrix. Or are you talking about using the
>>> user history as the query? in which case you have to remember somewhere all
>>> users' history and look it up for the query, no?
>>> 
>> 
>> Yes.  You do.  And that is the key to making this orders of magnitude
>> faster.
>> 
>> But that is generally fairly trivial to do.  One option is to keep it in a
>> cookie.  Another is to use browser persistent storage.  Another is to use a
>> memory based user profile database.  Yet another is to use M7 tables on
>> MapR or HBase on other Hadoop distributions.
>> 
> 
> Seems like you'd rule out browser based storage because you need the history 
> to train your next model. At least it would be in addition to a server based 
> storage of history. Another reason you wouldn't rely only on a browser 
> storage is that it will be occasionally destroyed. Users span multiple 
> devices these days too.
> 
> The user history matrix will be quite a bit larger than the user 
> recommendation matrix, maybe and order or two larger. I have 20 recs for me 
> stored but I've purchases 100's of items, and have viewed 1000's. 
> 
> Given that you have to have the entire user history vector to do the query 
> and given that this is still a lookup from an even larger matrix than the 
> recs/user matrix and given that you have to do the lookup before the Solr 
> query It can't be faster than just looking up pre-calculated recs. In other 
> words the query to produce the query will be more problematic than the query 
> to produce the result, right?
> 
> Something here may be "orders of magnitude" faster, but it isn't the total 
> elapsed time to return recs at runtime, right? Maybe what you are saying is 
> the time to pre-calculate the recs is 0 since they are calculated at runtime 
> but you still have to create the cooccurrence matrix so you still need 
> something like mahout hadoop to produce a model and you still need to index 
> the model with Solr and you still need to lookup user history at runtime. 
> Indexing with Solr is faster than loading a db (8 hours? They are doing 
> something wrong) but the query side will be slower unless I've missed 
> something. 
> 
> In any case you *have* introduced a realtime rec calculation. This is able to 
> use user history that may be seconds old and not yet reflected in the 
> training data (the cooccurrence matrix) and this is very interesting! 
> 
>>> 
>>> This will scale to thousands or tens of thousands of recommendations per
>>> second against 10's of millions of items.  The number of users doesn't
>>> matter.
>>> 
>> 
> 
> Yes, no doubt, but the history lookup is still an issue unless I've missed 
> something. The NoSQL queries will scale to tens of thousands of recs against 
> 10s of millions of items but perhaps with larger more complex infrastructure? 
> Not sure how Solr scales.
> 
> Being semi-ignorant of Solr intuition says that it's doing something to speed 
> things up like using only part of the data somewhere to do approximations. 
> Have there been any performance comparisons of say precision of one approach 
> vs the other or do they return identical results?
> 
> 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Which database should I use with Mahout

2013-05-20 Thread Pat Ferrel
I certainly have questions about this architecture mentioned below but first 
let me make sure I understand. 

You use the user history vector as a query? This will be a list of item IDs and 
strength-of-preference values (maybe 1s for purchases). The cooccurrence matrix 
has columns treated like terms and rows treated like documents though both are 
really items. Does Solr support weighted term lists as queries or do you have 
to throw out strength-of-preference? I ask because there are cases where the 
query will have non '1.0' values. When the strength values are just 1 the 
vector is really only a list or terms (items IDs).

This technique seems like using a doc as a query but you have reduced the doc 
to the form of a vector of weighted terms. I was unaware that Solr allowed 
weighted term queries. This is really identical to using Solr for fast doc 
similarity queries.

>> Using a cooccurrence matrix means you are doing item similairty since
>> there is no user data in the matrix. Or are you talking about using the
>> user history as the query? in which case you have to remember somewhere all
>> users' history and look it up for the query, no?
>> 
> 
> Yes.  You do.  And that is the key to making this orders of magnitude
> faster.
> 
> But that is generally fairly trivial to do.  One option is to keep it in a
> cookie.  Another is to use browser persistent storage.  Another is to use a
> memory based user profile database.  Yet another is to use M7 tables on
> MapR or HBase on other Hadoop distributions.
> 

Seems like you'd rule out browser based storage because you need the history to 
train your next model. At least it would be in addition to a server based 
storage of history. Another reason you wouldn't rely only on a browser storage 
is that it will be occasionally destroyed. Users span multiple devices these 
days too.

The user history matrix will be quite a bit larger than the user recommendation 
matrix, maybe and order or two larger. I have 20 recs for me stored but I've 
purchases 100's of items, and have viewed 1000's. 

Given that you have to have the entire user history vector to do the query and 
given that this is still a lookup from an even larger matrix than the recs/user 
matrix and given that you have to do the lookup before the Solr query It can't 
be faster than just looking up pre-calculated recs. In other words the query to 
produce the query will be more problematic than the query to produce the 
result, right?

Something here may be "orders of magnitude" faster, but it isn't the total 
elapsed time to return recs at runtime, right? Maybe what you are saying is the 
time to pre-calculate the recs is 0 since they are calculated at runtime but 
you still have to create the cooccurrence matrix so you still need something 
like mahout hadoop to produce a model and you still need to index the model 
with Solr and you still need to lookup user history at runtime. Indexing with 
Solr is faster than loading a db (8 hours? They are doing something wrong) but 
the query side will be slower unless I've missed something. 

In any case you *have* introduced a realtime rec calculation. This is able to 
use user history that may be seconds old and not yet reflected in the training 
data (the cooccurrence matrix) and this is very interesting! 

>> 
>> This will scale to thousands or tens of thousands of recommendations per
>> second against 10's of millions of items.  The number of users doesn't
>> matter.
>> 
> 

Yes, no doubt, but the history lookup is still an issue unless I've missed 
something. The NoSQL queries will scale to tens of thousands of recs against 
10s of millions of items but perhaps with larger more complex infrastructure? 
Not sure how Solr scales.

Being semi-ignorant of Solr intuition says that it's doing something to speed 
things up like using only part of the data somewhere to do approximations. Have 
there been any performance comparisons of say precision of one approach vs the 
other or do they return identical results?




Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:34 PM, Pat Ferrel  wrote:

> Won't argue with how fast Solr is, It's another fast and scalable lookup
> engine and another option. Especially if you don't need to lookup anything
> else by user, in which case you are back to a db...
>

But remember, it is also doing more than lookup.  It is computing scores on
items and retaining the highest scoring items.


> Using a cooccurrence matrix means you are doing item similairty since
> there is no user data in the matrix. Or are you talking about using the
> user history as the query? in which case you have to remember somewhere all
> users' history and look it up for the query, no?
>

Yes.  You do.  And that is the key to making this orders of magnitude
faster.

But that is generally fairly trivial to do.  One option is to keep it in a
cookie.  Another is to use browser persistent storage.  Another is to use a
memory based user profile database.  Yet another is to use M7 tables on
MapR or HBase on other Hadoop distributions.


> On May 19, 2013, at 8:09 PM, Ted Dunning  wrote:
>
> On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel  wrote:
>
> > Two basic solutions to this are: factorize (reduces 100s of thousands of
> > items to hundreds of 'features') and continue to calculate recs at
> runtime,
> > which you have to do with Myrrix since mahout does not have an in-memory
> > ALS impl, or move to the mahout hadoop recommenders and pre-calculate
> recs.
> >
>
> Or sparsify the cooccurrence matrix and run recommendations out of a search
> engine.
>
> This will scale to thousands or tens of thousands of recommendations per
> second against 10's of millions of items.  The number of users doesn't
> matter.
>
>


Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Won't argue with how fast Solr is, It's another fast and scalable lookup engine 
and another option. Especially if you don't need to lookup anything else by 
user, in which case you are back to a db...

Using a cooccurrence matrix means you are doing item similairty since there is 
no user data in the matrix. Or are you talking about using the user history as 
the query? in which case you have to remember somewhere all users' history and 
look it up for the query, no?

On May 19, 2013, at 8:09 PM, Ted Dunning  wrote:

On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel  wrote:

> Two basic solutions to this are: factorize (reduces 100s of thousands of
> items to hundreds of 'features') and continue to calculate recs at runtime,
> which you have to do with Myrrix since mahout does not have an in-memory
> ALS impl, or move to the mahout hadoop recommenders and pre-calculate recs.
> 

Or sparsify the cooccurrence matrix and run recommendations out of a search
engine.

This will scale to thousands or tens of thousands of recommendations per
second against 10's of millions of items.  The number of users doesn't
matter.



Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel  wrote:

> Two basic solutions to this are: factorize (reduces 100s of thousands of
> items to hundreds of 'features') and continue to calculate recs at runtime,
> which you have to do with Myrrix since mahout does not have an in-memory
> ALS impl, or move to the mahout hadoop recommenders and pre-calculate recs.
>

Or sparsify the cooccurrence matrix and run recommendations out of a search
engine.

This will scale to thousands or tens of thousands of recommendations per
second against 10's of millions of items.  The number of users doesn't
matter.


Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Ah, which for completeness, brings up another scaling issue with Mahout. The 
in-memory mahout recommenders do not pre-calculate all users recs. They keep 
the preference matrix in-memory and calculate the recommendations at runtime. 
At some point the size of your data will max a single machine. In my experience 
this happens by maxing CPU usage before the memory is maxed. I began to hit 
performance limits with 200,000 items and around 1M users. 

Two basic solutions to this are: factorize (reduces 100s of thousands of items 
to hundreds of 'features') and continue to calculate recs at runtime, which you 
have to do with Myrrix since mahout does not have an in-memory ALS impl, or 
move to the mahout hadoop recommenders and pre-calculate recs.


On May 19, 2013, at 6:34 PM, Sean Owen  wrote:

(I had in mind non distributed parts of Mahout but the principle is
similar, yes.)
On May 19, 2013 6:27 PM, "Pat Ferrel"  wrote:

> Using a Hadoop version of a Mahout recommender will create some number of
> recs for all users as its output. Sean is talking about Myrrix I think
> which uses factorization to get much smaller models and so can calculate
> the recs at runtime for fairly large user sets.
> 
> However if you are using Mahout and Hadoop the question is how to store
> and lookup recommendations in the quickest scalable way. You will have a
> user ID and perhaps an item ID as a key to the list of recommendations. The
> fastest thing to do is have a hashmap in memory, perhaps read in from HDFS.
> Remember that Mahout will output the recommendations with internal Mahout
> IDs so you will have to replace these in the data with your actual user and
> item ids.
> 
> I use a NoSQL DB, either MongoDB or Cassandra but others are fine too,
> even MySQL if you can scale it to meet your needs. I end up with two
> tables, one has my user ID as a key and recommendations with my item IDs
> either ordered or with strengths. The second table has my item ID as the
> key with a list of similar items (again sorted or with strengths). At
> runtime I may have both a user ID and an item ID context so I get a list
> from both tables and combine them at runtime.
> 
> I use a DB for many reasons and let it handle the caching. I never need to
> worry about memory management. If you have scaled your DB properly the
> lookups will actually be executed like an in-memory hashmap with indexed
> keys for ids. Scaling the DB can be done as your user base grows when
> needed without affecting the rest of the calculation pipeline. Yes there
> will be overhead due to network traffic in a cluster but the flexibility is
> worth it for me. If high availability is important you can spread out your
> db cluster over multiple data centers without affecting the API for serving
> recommendations. I set up the recommendation calculation to run
> continuously in the background, replacing values in the two tables as fast
> as I can. This allows you to scale update speed (how many machines in the
> mahout/hadoop cluster) independently from lookup performance scaling (how
> many machines in your db cluster, how much memory do the db machine have).
> 
> On May 19, 2013, at 11:45 AM, Manuel Blechschmidt <
> manuel.blechschm...@gmx.de> wrote:
> 
> Hi Tevfik,
> I am working with mysql but I would guess that HDFS like Sean suggested
> would be a good idea as well.
> 
> There is also a project called sqoop which can be used to transfer data
> from relation databases to Hadoop.
> 
> http://sqoop.apache.org/
> 
> Scribe might be also an option for transferring a lot of data:
> https://github.com/facebook/scribe#readme
> 
> I would suggest that you just start with the technology that you know best
> and then if you solve the problem as soon as you get them.
> 
> /Manuel
> 
> Am 19.05.2013 um 20:26 schrieb Sean Owen:
> 
>> I think everyone is agreeing that it is essential to only access
>> information in memory at run-time, yes, whatever that info may be.
>> I don't think the original question was about Hadoop, but, the answer
>> is the same: Hadoop mappers are just reading the input serially. There
>> is no advantage to a relational database or NoSQL database; they're
>> just overkill. HDFS is sufficient, and probably even best of these at
>> allowing fast serial access to the data.
>> 
>> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
>>  wrote:
>>> Hi Manuel,
>>> But if one uses matrix factorization and stores the user and item
>>> factors in memory then there will be no database access during
>>> recommendation.
>>> I thought that the original question was where to store the data and
>>> how to give it to hadoop.
>>> 
>>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>>>  wrote:
 Hi Tevfik,
 one request to the recommender could become more then 1000 queries to
> the database depending on which recommender you use and the amount of
> preferences for the given user.
 
 The problem is not if you are using SQL, NoSQL, or any other query
> la

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 6:26 PM, Pat Ferrel  wrote:

> Using a Hadoop version of a Mahout recommender will create some number of
> recs for all users as its output. Sean is talking about Myrrix I think
> which uses factorization to get much smaller models and so can calculate
> the recs at runtime for fairly large user sets.
>

The Mahout recommender can also produce a model in the form of item-item
matrices that can be used to produce recommendations on the fly from
memory-based model.

However if you are using Mahout and Hadoop the question is how to store and
> lookup recommendations in the quickest scalable way. You will have a user
> ID and perhaps an item ID as a key to the list of recommendations. The
> fastest thing to do is have a hashmap in memory, perhaps read in from HDFS.


Or just use SolR and create the recommendations on the fly.


> Remember that Mahout will output the recommendations with internal Mahout
> IDs so you will have to r

eplace these in the data with your actual user and item ids.
>

This can be repaired a index time using a search engine as well.


> I use a NoSQL DB, either MongoDB or Cassandra but others are fine too,
> even MySQL if you can scale it to meet your needs. I end up with two
> tables, one has my user ID as a key and recommendations with my item IDs
> either ordered or with strengths. The second table has my item ID as the
> key with a list of similar items (again sorted or with strengths). At
> runtime I may have both a user ID and an item ID context so I get a list
> from both tables and combine them at runtime.
>

MapR has a large bank as a client who used this approach.  Exporting recs
took 8 hours.  Switching to Solr to compute the recommendations decreased
export time to under 3 minutes.


>


Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
(I had in mind non distributed parts of Mahout but the principle is
similar, yes.)
On May 19, 2013 6:27 PM, "Pat Ferrel"  wrote:

> Using a Hadoop version of a Mahout recommender will create some number of
> recs for all users as its output. Sean is talking about Myrrix I think
> which uses factorization to get much smaller models and so can calculate
> the recs at runtime for fairly large user sets.
>
> However if you are using Mahout and Hadoop the question is how to store
> and lookup recommendations in the quickest scalable way. You will have a
> user ID and perhaps an item ID as a key to the list of recommendations. The
> fastest thing to do is have a hashmap in memory, perhaps read in from HDFS.
> Remember that Mahout will output the recommendations with internal Mahout
> IDs so you will have to replace these in the data with your actual user and
> item ids.
>
> I use a NoSQL DB, either MongoDB or Cassandra but others are fine too,
> even MySQL if you can scale it to meet your needs. I end up with two
> tables, one has my user ID as a key and recommendations with my item IDs
> either ordered or with strengths. The second table has my item ID as the
> key with a list of similar items (again sorted or with strengths). At
> runtime I may have both a user ID and an item ID context so I get a list
> from both tables and combine them at runtime.
>
> I use a DB for many reasons and let it handle the caching. I never need to
> worry about memory management. If you have scaled your DB properly the
> lookups will actually be executed like an in-memory hashmap with indexed
> keys for ids. Scaling the DB can be done as your user base grows when
> needed without affecting the rest of the calculation pipeline. Yes there
> will be overhead due to network traffic in a cluster but the flexibility is
> worth it for me. If high availability is important you can spread out your
> db cluster over multiple data centers without affecting the API for serving
> recommendations. I set up the recommendation calculation to run
> continuously in the background, replacing values in the two tables as fast
> as I can. This allows you to scale update speed (how many machines in the
> mahout/hadoop cluster) independently from lookup performance scaling (how
> many machines in your db cluster, how much memory do the db machine have).
>
> On May 19, 2013, at 11:45 AM, Manuel Blechschmidt <
> manuel.blechschm...@gmx.de> wrote:
>
> Hi Tevfik,
> I am working with mysql but I would guess that HDFS like Sean suggested
> would be a good idea as well.
>
> There is also a project called sqoop which can be used to transfer data
> from relation databases to Hadoop.
>
> http://sqoop.apache.org/
>
> Scribe might be also an option for transferring a lot of data:
> https://github.com/facebook/scribe#readme
>
> I would suggest that you just start with the technology that you know best
> and then if you solve the problem as soon as you get them.
>
> /Manuel
>
> Am 19.05.2013 um 20:26 schrieb Sean Owen:
>
> > I think everyone is agreeing that it is essential to only access
> > information in memory at run-time, yes, whatever that info may be.
> > I don't think the original question was about Hadoop, but, the answer
> > is the same: Hadoop mappers are just reading the input serially. There
> > is no advantage to a relational database or NoSQL database; they're
> > just overkill. HDFS is sufficient, and probably even best of these at
> > allowing fast serial access to the data.
> >
> > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
> >  wrote:
> >> Hi Manuel,
> >> But if one uses matrix factorization and stores the user and item
> >> factors in memory then there will be no database access during
> >> recommendation.
> >> I thought that the original question was where to store the data and
> >> how to give it to hadoop.
> >>
> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
> >>  wrote:
> >>> Hi Tevfik,
> >>> one request to the recommender could become more then 1000 queries to
> the database depending on which recommender you use and the amount of
> preferences for the given user.
> >>>
> >>> The problem is not if you are using SQL, NoSQL, or any other query
> language. The problem is the latency of the answers.
> >>>
> >>> An average tcp package in the same data center takes 500 µs. A main
> memory reference 0,1 µs. This means that your main memory of your java
> process can be accessed 5000 times faster then any other process like a
> database connected via TCP/IP.
> >>>
> >>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
> >>>
> >>> Here you can see a screenshot that shows that database communication
> is by far (99%) the slowest component of a recommender request:
> >>>
> >>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
> >>>
> >>> If you do not want to cache your data in your Java process you can use
> a complete in memory database technology like SAP HANA
> http://www.saphana.com/welcome or EXASOL http://

Re: Which database should I use with Mahout

2013-05-19 Thread Pat Ferrel
Using a Hadoop version of a Mahout recommender will create some number of recs 
for all users as its output. Sean is talking about Myrrix I think which uses 
factorization to get much smaller models and so can calculate the recs at 
runtime for fairly large user sets.

However if you are using Mahout and Hadoop the question is how to store and 
lookup recommendations in the quickest scalable way. You will have a user ID 
and perhaps an item ID as a key to the list of recommendations. The fastest 
thing to do is have a hashmap in memory, perhaps read in from HDFS. Remember 
that Mahout will output the recommendations with internal Mahout IDs so you 
will have to replace these in the data with your actual user and item ids.

I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, even 
MySQL if you can scale it to meet your needs. I end up with two tables, one has 
my user ID as a key and recommendations with my item IDs either ordered or with 
strengths. The second table has my item ID as the key with a list of similar 
items (again sorted or with strengths). At runtime I may have both a user ID 
and an item ID context so I get a list from both tables and combine them at 
runtime.

I use a DB for many reasons and let it handle the caching. I never need to 
worry about memory management. If you have scaled your DB properly the lookups 
will actually be executed like an in-memory hashmap with indexed keys for ids. 
Scaling the DB can be done as your user base grows when needed without 
affecting the rest of the calculation pipeline. Yes there will be overhead due 
to network traffic in a cluster but the flexibility is worth it for me. If high 
availability is important you can spread out your db cluster over multiple data 
centers without affecting the API for serving recommendations. I set up the 
recommendation calculation to run continuously in the background, replacing 
values in the two tables as fast as I can. This allows you to scale update 
speed (how many machines in the mahout/hadoop cluster) independently from 
lookup performance scaling (how many machines in your db cluster, how much 
memory do the db machine have).
 
On May 19, 2013, at 11:45 AM, Manuel Blechschmidt  
wrote:

Hi Tevfik,
I am working with mysql but I would guess that HDFS like Sean suggested would 
be a good idea as well.

There is also a project called sqoop which can be used to transfer data from 
relation databases to Hadoop.

http://sqoop.apache.org/

Scribe might be also an option for transferring a lot of data:
https://github.com/facebook/scribe#readme

I would suggest that you just start with the technology that you know best and 
then if you solve the problem as soon as you get them.

/Manuel

Am 19.05.2013 um 20:26 schrieb Sean Owen:

> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
> 
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
>  wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>> 
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>>  wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>> 
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>> 
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>> 
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>> 
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>> 
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>> 
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>> 
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>> 
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>> 

Re: Which database should I use with Mahout

2013-05-19 Thread Manuel Blechschmidt
Hi Tevfik,
I am working with mysql but I would guess that HDFS like Sean suggested would 
be a good idea as well.

There is also a project called sqoop which can be used to transfer data from 
relation databases to Hadoop.

http://sqoop.apache.org/

Scribe might be also an option for transferring a lot of data:
https://github.com/facebook/scribe#readme

I would suggest that you just start with the technology that you know best and 
then if you solve the problem as soon as you get them.

/Manuel
 
Am 19.05.2013 um 20:26 schrieb Sean Owen:

> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
> 
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
>  wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>> 
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>>  wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>> 
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>> 
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>> 
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>> 
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>> 
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>> 
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>> 
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>> 
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>> 
>>> Hope that helps
>>>Manuel
>>> 
>>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>> 
 I'm first saying that you really don't want to use the database as a
 data model directly. It is far too slow.
 Instead you want to use a data model implementation that reads all of
 the data, once, serially, into memory. And in that case, it makes no
 difference where the data is being read from, because it is read just
 once, serially. A file is just as fine as a fancy database. In fact
 it's probably easier and faster.
 
 On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
  wrote:
> Thanks Sean, but I could not get your answer. Can you please explain it 
> again?
> 
> 
> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>> It doesn't matter, in the sense that it is never going to be fast
>> enough for real-time at any reasonable scale if actually run off a
>> database directly. One operation results in thousands of queries. It's
>> going to read data into memory anyway and cache it there. So, whatever
>> is easiest for you. The simplest solution is a file.
>> 
>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>  wrote:
>>> Hi,
>>> I would like to use Mahout to make recommendations on my web site. 
>>> Since the data is going to be big, hopefully, I plan to use hadoop 
>>> implementations of the recommender algorithms.
>>> 
>>> I'm currently storing the data in mysql. Should I continue with it or 
>>> should I switch to a nosql database such as mongodb or something else?
>>> 
>>> Thanks
>>> Ahmet
>>> 
>>> --
>>> Manuel Blechschmidt
>>> M.Sc. IT Systems Engineering
>>> Dortustr. 57
>>> 14467 Potsdam
>>> Mobil: 0173/6322621
>>> Twitter: http://twitter.com/Manuel_B
>>> 

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B



Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
(Oh, by the way, I realize the original question was about Hadoop. I
can't read carefully.)

No, HDFS is not good for anything like random access. For input,
that's OK, because you don't need random access. So HDFS is just fine.
For output, if you are going to then serve these precomputed results
at run-time, they need to be in a container appropriate for quick
random access. There, a NoSQL store like HBase or something does sound
appropriate. You can create an output format that writes directly into
it, with a little work.

The drawbacks to this approach -- computing results in Hadoop -- is
that they are inevitably a bit stale, not real-time, and you have to
compute results for everyone, even though very few of those results
will be used. Of course, serving is easy and fast. There are hybrid
solutions that I can talk to you about offline that get a bit of the
best of both worlds.


On Sun, May 19, 2013 at 11:37 AM, Ahmet Ylmaz
 wrote:
> Hi Sean,
> If I understood you correctly you are saying that I will not need mysql. But 
> if I store my data on HDFS will I be make fast queries such as
> "Return all the ratings of a specific user"
> which will be needed for showing the past ratings of a user.
>
> Ahmet
>
>
> 
>  From: Sean Owen 
> To: Mahout User List 
> Sent: Sunday, May 19, 2013 9:26 PM
> Subject: Re: Which database should I use with Mahout
>
>
> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
>
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
>  wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>>
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>>  wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>>
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>>
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>>
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>>
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>>
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>>
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>>
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>>
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>>
>>> Hope that helps
>>> Manuel
>>>
>>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>>
>>>> I'm first saying that you really don't want to use the database as a
>>>> data model directly. It is far too slow.
>>>> Instead you want to use a data model implementation that reads all of
>>>> the data, once, serially, into memory. And in that case, it makes no
>>>> difference where the data is being read from, because it is read just
>>>> once, serially. A file is just as fine as a fancy database. In fact
>>>> it's probably easier and faster.
>>>>
>>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>>  wrote:
>>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>>> again?
>>>>>
>>&

Re: Which database should I use with Mahout

2013-05-19 Thread Ahmet Ylmaz
Hi Sean,
If I understood you correctly you are saying that I will not need mysql. But if 
I store my data on HDFS will I be make fast queries such as
"Return all the ratings of a specific user" 
which will be needed for showing the past ratings of a user.

Ahmet 



 From: Sean Owen 
To: Mahout User List  
Sent: Sunday, May 19, 2013 9:26 PM
Subject: Re: Which database should I use with Mahout
 

I think everyone is agreeing that it is essential to only access
information in memory at run-time, yes, whatever that info may be.
I don't think the original question was about Hadoop, but, the answer
is the same: Hadoop mappers are just reading the input serially. There
is no advantage to a relational database or NoSQL database; they're
just overkill. HDFS is sufficient, and probably even best of these at
allowing fast serial access to the data.

On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
 wrote:
> Hi Manuel,
> But if one uses matrix factorization and stores the user and item
> factors in memory then there will be no database access during
> recommendation.
> I thought that the original question was where to store the data and
> how to give it to hadoop.
>
> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>  wrote:
>> Hi Tevfik,
>> one request to the recommender could become more then 1000 queries to the 
>> database depending on which recommender you use and the amount of 
>> preferences for the given user.
>>
>> The problem is not if you are using SQL, NoSQL, or any other query language. 
>> The problem is the latency of the answers.
>>
>> An average tcp package in the same data center takes 500 µs. A main memory 
>> reference 0,1 µs. This means that your main memory of your java process can 
>> be accessed 5000 times faster then any other process like a database 
>> connected via TCP/IP.
>>
>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>
>> Here you can see a screenshot that shows that database communication is by 
>> far (99%) the slowest component of a recommender request:
>>
>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>
>> If you do not want to cache your data in your Java process you can use a 
>> complete in memory database technology like SAP HANA 
>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>
>> Nevertheless if you are using these you do not need Mahout anymore.
>>
>> An architecture of a Mahout system can be seen here:
>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>
>> Hope that helps
>>     Manuel
>>
>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>
>>> I'm first saying that you really don't want to use the database as a
>>> data model directly. It is far too slow.
>>> Instead you want to use a data model implementation that reads all of
>>> the data, once, serially, into memory. And in that case, it makes no
>>> difference where the data is being read from, because it is read just
>>> once, serially. A file is just as fine as a fancy database. In fact
>>> it's probably easier and faster.
>>>
>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>  wrote:
>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>> again?
>>>>
>>>>
>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>>>>> It doesn't matter, in the sense that it is never going to be fast
>>>>> enough for real-time at any reasonable scale if actually run off a
>>>>> database directly. One operation results in thousands of queries. It's
>>>>> going to read data into memory anyway and cache it there. So, whatever
>>>>> is easiest for you. The simplest solution is a file.
>>>>>
>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>>>  wrote:
>>>>>> Hi,
>>>>>> I would like to use Mahout to make recommendations on my web site. Since 
>>>>>> the data is going to be big, hopefully, I plan to use hadoop 
>>>>>> implementations of the recommender algorithms.
>>>>>>
>>>>>> I'm currently storing the data in mysql. Should I continue with it or 
>>>>>> should I switch to a nosql database such as mongodb or something else?
>>>>>>
>>>>>> Thanks
>>>>>> Ahmet
>>
>> --
>> Manuel Blechschmidt
>> M.Sc. IT Systems Engineering
>> Dortustr. 57
>> 14467 Potsdam
>> Mobil: 0173/6322621
>> Twitter: http://twitter.com/Manuel_B
>>

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
I think everyone is agreeing that it is essential to only access
information in memory at run-time, yes, whatever that info may be.
I don't think the original question was about Hadoop, but, the answer
is the same: Hadoop mappers are just reading the input serially. There
is no advantage to a relational database or NoSQL database; they're
just overkill. HDFS is sufficient, and probably even best of these at
allowing fast serial access to the data.

On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
 wrote:
> Hi Manuel,
> But if one uses matrix factorization and stores the user and item
> factors in memory then there will be no database access during
> recommendation.
> I thought that the original question was where to store the data and
> how to give it to hadoop.
>
> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>  wrote:
>> Hi Tevfik,
>> one request to the recommender could become more then 1000 queries to the 
>> database depending on which recommender you use and the amount of 
>> preferences for the given user.
>>
>> The problem is not if you are using SQL, NoSQL, or any other query language. 
>> The problem is the latency of the answers.
>>
>> An average tcp package in the same data center takes 500 µs. A main memory 
>> reference 0,1 µs. This means that your main memory of your java process can 
>> be accessed 5000 times faster then any other process like a database 
>> connected via TCP/IP.
>>
>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>
>> Here you can see a screenshot that shows that database communication is by 
>> far (99%) the slowest component of a recommender request:
>>
>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>
>> If you do not want to cache your data in your Java process you can use a 
>> complete in memory database technology like SAP HANA 
>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>
>> Nevertheless if you are using these you do not need Mahout anymore.
>>
>> An architecture of a Mahout system can be seen here:
>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>
>> Hope that helps
>> Manuel
>>
>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>
>>> I'm first saying that you really don't want to use the database as a
>>> data model directly. It is far too slow.
>>> Instead you want to use a data model implementation that reads all of
>>> the data, once, serially, into memory. And in that case, it makes no
>>> difference where the data is being read from, because it is read just
>>> once, serially. A file is just as fine as a fancy database. In fact
>>> it's probably easier and faster.
>>>
>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>  wrote:
 Thanks Sean, but I could not get your answer. Can you please explain it 
 again?


 On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
> It doesn't matter, in the sense that it is never going to be fast
> enough for real-time at any reasonable scale if actually run off a
> database directly. One operation results in thousands of queries. It's
> going to read data into memory anyway and cache it there. So, whatever
> is easiest for you. The simplest solution is a file.
>
> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>  wrote:
>> Hi,
>> I would like to use Mahout to make recommendations on my web site. Since 
>> the data is going to be big, hopefully, I plan to use hadoop 
>> implementations of the recommender algorithms.
>>
>> I'm currently storing the data in mysql. Should I continue with it or 
>> should I switch to a nosql database such as mongodb or something else?
>>
>> Thanks
>> Ahmet
>>
>> --
>> Manuel Blechschmidt
>> M.Sc. IT Systems Engineering
>> Dortustr. 57
>> 14467 Potsdam
>> Mobil: 0173/6322621
>> Twitter: http://twitter.com/Manuel_B
>>


Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Hi Manuel,
But if one uses matrix factorization and stores the user and item
factors in memory then there will be no database access during
recommendation.
I thought that the original question was where to store the data and
how to give it to hadoop.

On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
 wrote:
> Hi Tevfik,
> one request to the recommender could become more then 1000 queries to the 
> database depending on which recommender you use and the amount of preferences 
> for the given user.
>
> The problem is not if you are using SQL, NoSQL, or any other query language. 
> The problem is the latency of the answers.
>
> An average tcp package in the same data center takes 500 µs. A main memory 
> reference 0,1 µs. This means that your main memory of your java process can 
> be accessed 5000 times faster then any other process like a database 
> connected via TCP/IP.
>
> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>
> Here you can see a screenshot that shows that database communication is by 
> far (99%) the slowest component of a recommender request:
>
> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>
> If you do not want to cache your data in your Java process you can use a 
> complete in memory database technology like SAP HANA 
> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>
> Nevertheless if you are using these you do not need Mahout anymore.
>
> An architecture of a Mahout system can be seen here:
> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>
> Hope that helps
> Manuel
>
> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>
>> I'm first saying that you really don't want to use the database as a
>> data model directly. It is far too slow.
>> Instead you want to use a data model implementation that reads all of
>> the data, once, serially, into memory. And in that case, it makes no
>> difference where the data is being read from, because it is read just
>> once, serially. A file is just as fine as a fancy database. In fact
>> it's probably easier and faster.
>>
>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>  wrote:
>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>> again?
>>>
>>>
>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
 It doesn't matter, in the sense that it is never going to be fast
 enough for real-time at any reasonable scale if actually run off a
 database directly. One operation results in thousands of queries. It's
 going to read data into memory anyway and cache it there. So, whatever
 is easiest for you. The simplest solution is a file.

 On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
  wrote:
> Hi,
> I would like to use Mahout to make recommendations on my web site. Since 
> the data is going to be big, hopefully, I plan to use hadoop 
> implementations of the recommender algorithms.
>
> I'm currently storing the data in mysql. Should I continue with it or 
> should I switch to a nosql database such as mongodb or something else?
>
> Thanks
> Ahmet
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>


Re: Which database should I use with Mahout

2013-05-19 Thread Manuel Blechschmidt
Hi Tevfik,
one request to the recommender could become more then 1000 queries to the 
database depending on which recommender you use and the amount of preferences 
for the given user.

The problem is not if you are using SQL, NoSQL, or any other query language. 
The problem is the latency of the answers.

An average tcp package in the same data center takes 500 µs. A main memory 
reference 0,1 µs. This means that your main memory of your java process can be 
accessed 5000 times faster then any other process like a database connected via 
TCP/IP.

http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Here you can see a screenshot that shows that database communication is by far 
(99%) the slowest component of a recommender request:

https://source.apaxo.de/MahoutDatabaseLowPerformance.png

If you do not want to cache your data in your Java process you can use a 
complete in memory database technology like SAP HANA 
http://www.saphana.com/welcome or EXASOL http://www.exasol.com/

Nevertheless if you are using these you do not need Mahout anymore.

An architecture of a Mahout system can be seen here:
https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png

Hope that helps
Manuel

Am 19.05.2013 um 19:20 schrieb Sean Owen:

> I'm first saying that you really don't want to use the database as a
> data model directly. It is far too slow.
> Instead you want to use a data model implementation that reads all of
> the data, once, serially, into memory. And in that case, it makes no
> difference where the data is being read from, because it is read just
> once, serially. A file is just as fine as a fancy database. In fact
> it's probably easier and faster.
> 
> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>  wrote:
>> Thanks Sean, but I could not get your answer. Can you please explain it 
>> again?
>> 
>> 
>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>>> It doesn't matter, in the sense that it is never going to be fast
>>> enough for real-time at any reasonable scale if actually run off a
>>> database directly. One operation results in thousands of queries. It's
>>> going to read data into memory anyway and cache it there. So, whatever
>>> is easiest for you. The simplest solution is a file.
>>> 
>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>  wrote:
 Hi,
 I would like to use Mahout to make recommendations on my web site. Since 
 the data is going to be big, hopefully, I plan to use hadoop 
 implementations of the recommender algorithms.
 
 I'm currently storing the data in mysql. Should I continue with it or 
 should I switch to a nosql database such as mongodb or something else?
 
 Thanks
 Ahmet

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B



Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
ok, got it, thanks.

On Sun, May 19, 2013 at 8:20 PM, Sean Owen  wrote:
> I'm first saying that you really don't want to use the database as a
> data model directly. It is far too slow.
> Instead you want to use a data model implementation that reads all of
> the data, once, serially, into memory. And in that case, it makes no
> difference where the data is being read from, because it is read just
> once, serially. A file is just as fine as a fancy database. In fact
> it's probably easier and faster.
>
> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>  wrote:
>> Thanks Sean, but I could not get your answer. Can you please explain it 
>> again?
>>
>>
>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>>> It doesn't matter, in the sense that it is never going to be fast
>>> enough for real-time at any reasonable scale if actually run off a
>>> database directly. One operation results in thousands of queries. It's
>>> going to read data into memory anyway and cache it there. So, whatever
>>> is easiest for you. The simplest solution is a file.
>>>
>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>  wrote:
 Hi,
 I would like to use Mahout to make recommendations on my web site. Since 
 the data is going to be big, hopefully, I plan to use hadoop 
 implementations of the recommender algorithms.

 I'm currently storing the data in mysql. Should I continue with it or 
 should I switch to a nosql database such as mongodb or something else?

 Thanks
 Ahmet


Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
I'm first saying that you really don't want to use the database as a
data model directly. It is far too slow.
Instead you want to use a data model implementation that reads all of
the data, once, serially, into memory. And in that case, it makes no
difference where the data is being read from, because it is read just
once, serially. A file is just as fine as a fancy database. In fact
it's probably easier and faster.

On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
 wrote:
> Thanks Sean, but I could not get your answer. Can you please explain it again?
>
>
> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>> It doesn't matter, in the sense that it is never going to be fast
>> enough for real-time at any reasonable scale if actually run off a
>> database directly. One operation results in thousands of queries. It's
>> going to read data into memory anyway and cache it there. So, whatever
>> is easiest for you. The simplest solution is a file.
>>
>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>  wrote:
>>> Hi,
>>> I would like to use Mahout to make recommendations on my web site. Since 
>>> the data is going to be big, hopefully, I plan to use hadoop 
>>> implementations of the recommender algorithms.
>>>
>>> I'm currently storing the data in mysql. Should I continue with it or 
>>> should I switch to a nosql database such as mongodb or something else?
>>>
>>> Thanks
>>> Ahmet


Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Thanks Sean, but I could not get your answer. Can you please explain it again?


On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
> It doesn't matter, in the sense that it is never going to be fast
> enough for real-time at any reasonable scale if actually run off a
> database directly. One operation results in thousands of queries. It's
> going to read data into memory anyway and cache it there. So, whatever
> is easiest for you. The simplest solution is a file.
>
> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>  wrote:
>> Hi,
>> I would like to use Mahout to make recommendations on my web site. Since the 
>> data is going to be big, hopefully, I plan to use hadoop implementations of 
>> the recommender algorithms.
>>
>> I'm currently storing the data in mysql. Should I continue with it or should 
>> I switch to a nosql database such as mongodb or something else?
>>
>> Thanks
>> Ahmet


Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen
It doesn't matter, in the sense that it is never going to be fast
enough for real-time at any reasonable scale if actually run off a
database directly. One operation results in thousands of queries. It's
going to read data into memory anyway and cache it there. So, whatever
is easiest for you. The simplest solution is a file.

On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
 wrote:
> Hi,
> I would like to use Mahout to make recommendations on my web site. Since the 
> data is going to be big, hopefully, I plan to use hadoop implementations of 
> the recommender algorithms.
>
> I'm currently storing the data in mysql. Should I continue with it or should 
> I switch to a nosql database such as mongodb or something else?
>
> Thanks
> Ahmet