Co-occurrence matrices shold be fairly easy to partition over many
machines, so you would not be constrained by the memory available on a
single machine.
On 06.05.2015 18:29, Pat Ferrel wrote:
100GB of RAM is practically common. Recently I’ve seen many indicators and item
metadata stored with
100GB of RAM is practically common. Recently I’ve seen many indicators and item
metadata stored with cooccurrence and indexed. This produces extremely flexible
results since the query determines the result, not the model. But it does
increase the number of cooccurrences linearly with # of indica
Sounds about right.
My guess is that memory is now large enough, especially on a cluster that
the cooccurrence will fit into memory quite often. Taking a large example
of 10 million items and 10,000 cooccurrences each, there will be 100
billion cooccurrences to store which shouldn't take more tha
Ok, seems right.
So now to data structures. The input frequency vectors need to be paired with
each input interaction type and would be nice to have as something that can be
copied very fast as they get updated. Random access would also be nice but
iteration is not needed. Over time they will g
On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel wrote:
> This seems to violate the random choice of interactions to cut but now
> that I think about it does a random choice really matter?
>
It hasn't ever mattered such that I could see. There is also some reason
to claim that earliest is best if it
Randomizing interaction down-sampling is probably more important on the
starting batch since it is done on entire input row or column, not so important
when a cut-off is already reached. All new interactions (new items for
instance) would not have reached the cut anyway, which is important since
Removal is not as important as adding (which can be done). Also removal is
often for business logic, like removal from a catalog, so a refresh may be
driven by non-math considerations. Removal of users is only to clean up things,
not required very often. Removal of items can happen from recs too
On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel wrote:
> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.
Actually, the method I was pushing is exact. If the sampling
Currently maxPrefs is applied to input, both row and column (in hadoop and
scala) and has a default of 500. maxSimilaritiesPerItem is for the cooccurrence
matrix and is applied to rows. The default is 50. Similar down-sampling is done
on row similarity.
For a new way to use threshold I was thin
Inline
On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel wrote:
> Short answer, you are correct this is not a new filter.
>
> The Hadoop MapReduce implements:
> * maxSimilaritiesPerItem
> * maxPrefs
> * minPrefsPerUser
> * threshold
>
> Scala version:
> * maxSimilaritiesPerItem
>
I think of this as
Short answer, you are correct this is not a new filter.
The Hadoop MapReduce implements:
* maxSimilaritiesPerItem
* maxPrefs
* minPrefsPerUser
* threshold
Scala version:
* maxSimilaritiesPerItem
* maxPrefs
The paper talks about an interaction-cut, and describes it with "There is no
significant
I just answered in another thread.
Yes, I am. I just didn't think I was proposing it. Thought it was in
Sebastian's paper and ultimately in our code (that I haven't looked at in
over a year).
On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel wrote:
> Hey Ted,
>
> You seem to be proposing a new cut
On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel wrote:
> If so you would need something like 4 things with random accesss/in memory
> speed
> 1&2) frequency vectors for interactions pre user and item, these may be
> updated and are used to calculate LLR and also for cutting new update
> interactions
Hey Ted,
You seem to be proposing a new cut by frequency of item interaction, is this
correct? It is performed before the multiply, right?
On Apr 18, 2015, at 8:29 AM, Pat Ferrel wrote:
There are two cuts applied in the batch calc:
1) number of interactions per user
2) number of items in the
On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel wrote:
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? This is because the frequency is known before the multiply
> and LLR. I assume the #2 cut is left in place?
>
Yes. but I didn't think it was new.
Cool
On Saturday, April 18, 2015, Ted Dunning wrote:
>
> Andrew
>
> Take a look at the slides I posted. In them I showed that the update does
> not grow beyond a very reasonable bound.
>
> Sent from my iPhone
>
> > On Apr 18, 2015, at 9:15, Andrew Musselman > wrote:
> >
> > Yes that's what I m
Andrew
Take a look at the slides I posted. In them I showed that the update does not
grow beyond a very reasonable bound.
Sent from my iPhone
> On Apr 18, 2015, at 9:15, Andrew Musselman wrote:
>
> Yes that's what I mean; if the number of updates gets too big it probably
> would be unmana
Yes that's what I mean; if the number of updates gets too big it probably
would be unmanageable though. This approach worked well with daily
updates, but never tried it with anything "real time."
On Saturday, April 18, 2015, Pat Ferrel wrote:
> I think you are saying that instead of val newHash
I think you are saying that instead of val newHashMap = lastHashMap ++
updateHashMap, layered updates might be useful since new and last are
potentially large. Some limit of updates might trigger a refresh. This might
work if the update works with incremental index updates in the search engine.
There are two cuts applied in the batch calc:
1) number of interactions per user
2) number of items in the resulting cooccurrence vectors (calc LLR, sort,
lowest items cut per limit)
You seem to be proposing a new cut by frequency of item interaction, is this
correct? This is because the frequen
I have not implemented it for recommendations but a layered cache/sieve
structure could be useful.
That is, between batch refreshes you can keep tacking on new updates in a
cascading order so values that are updated exist in the newest layer but
otherwise the lookup goes for the latest updated lay
Yes. Also add the fact that the nano batches are bounded tightly in size both
max and mean. And mostly filtered away anyway.
Aging is an open question. I have never seen any effect of alternative sampling
so I would just assume "keep oldest" which just tosses more samples. Then
occasionally r
Thanks.
This idea is based on a micro-batch of interactions per update, not individual
ones unless I missed something. That matches the typical input flow. Most
interactions are filtered away by frequency and number of interaction cuts.
A couple practical issues
In practice won’t this requir
When I think of real-time adaptation of indicators, I think of this:
http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel wrote:
> I’ve been thinking about Streaming (continuous input) and incr
I’ve been thinking about Streaming (continuous input) and incremental
coccurrence.
As interactions stream in from the user it it fairly simple to use something
like Spark streaming to maintain a moving time window for all input, and an
update frequency that recalcs all input currently in the ti
25 matches
Mail list logo