Re: Streaming and incremental cooccurrence

2015-05-09 Thread Sebastian
Co-occurrence matrices shold be fairly easy to partition over many machines, so you would not be constrained by the memory available on a single machine. On 06.05.2015 18:29, Pat Ferrel wrote: 100GB of RAM is practically common. Recently I’ve seen many indicators and item metadata stored with

Re: Streaming and incremental cooccurrence

2015-05-06 Thread Pat Ferrel
100GB of RAM is practically common. Recently I’ve seen many indicators and item metadata stored with cooccurrence and indexed. This produces extremely flexible results since the query determines the result, not the model. But it does increase the number of cooccurrences linearly with # of indica

Re: Streaming and incremental cooccurrence

2015-04-24 Thread Ted Dunning
Sounds about right. My guess is that memory is now large enough, especially on a cluster that the cooccurrence will fit into memory quite often. Taking a large example of 10 million items and 10,000 cooccurrences each, there will be 100 billion cooccurrences to store which shouldn't take more tha

Re: Streaming and incremental cooccurrence

2015-04-24 Thread Pat Ferrel
Ok, seems right. So now to data structures. The input frequency vectors need to be paired with each input interaction type and would be nice to have as something that can be copied very fast as they get updated. Random access would also be nice but iteration is not needed. Over time they will g

Re: Streaming and incremental cooccurrence

2015-04-23 Thread Ted Dunning
On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel wrote: > This seems to violate the random choice of interactions to cut but now > that I think about it does a random choice really matter? > It hasn't ever mattered such that I could see. There is also some reason to claim that earliest is best if it

Re: Streaming and incremental cooccurrence

2015-04-23 Thread Pat Ferrel
Randomizing interaction down-sampling is probably more important on the starting batch since it is done on entire input row or column, not so important when a cut-off is already reached. All new interactions (new items for instance) would not have reached the cut anyway, which is important since

Re: Streaming and incremental cooccurrence

2015-04-23 Thread Pat Ferrel
Removal is not as important as adding (which can be done). Also removal is often for business logic, like removal from a catalog, so a refresh may be driven by non-math considerations. Removal of users is only to clean up things, not required very often. Removal of items can happen from recs too

Re: Streaming and incremental cooccurrence

2015-04-22 Thread Ted Dunning
On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel wrote: > I think we have been talking about an idea that does an incremental > approximation, then a refresh every so often to remove any approximation so > in an ideal world we need both. Actually, the method I was pushing is exact. If the sampling

Re: Streaming and incremental cooccurrence

2015-04-22 Thread Pat Ferrel
Currently maxPrefs is applied to input, both row and column (in hadoop and scala) and has a default of 500. maxSimilaritiesPerItem is for the cooccurrence matrix and is applied to rows. The default is 50. Similar down-sampling is done on row similarity. For a new way to use threshold I was thin

Re: Streaming and incremental cooccurrence

2015-04-19 Thread Ted Dunning
Inline On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel wrote: > Short answer, you are correct this is not a new filter. > > The Hadoop MapReduce implements: > * maxSimilaritiesPerItem > * maxPrefs > * minPrefsPerUser > * threshold > > Scala version: > * maxSimilaritiesPerItem > I think of this as

Re: Streaming and incremental cooccurrence

2015-04-19 Thread Pat Ferrel
Short answer, you are correct this is not a new filter. The Hadoop MapReduce implements: * maxSimilaritiesPerItem * maxPrefs * minPrefsPerUser * threshold Scala version: * maxSimilaritiesPerItem * maxPrefs The paper talks about an interaction-cut, and describes it with "There is no significant

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Ted Dunning
I just answered in another thread. Yes, I am. I just didn't think I was proposing it. Thought it was in Sebastian's paper and ultimately in our code (that I haven't looked at in over a year). On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel wrote: > Hey Ted, > > You seem to be proposing a new cut

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Ted Dunning
On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel wrote: > If so you would need something like 4 things with random accesss/in memory > speed > 1&2) frequency vectors for interactions pre user and item, these may be > updated and are used to calculate LLR and also for cutting new update > interactions

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Pat Ferrel
Hey Ted, You seem to be proposing a new cut by frequency of item interaction, is this correct? It is performed before the multiply, right? On Apr 18, 2015, at 8:29 AM, Pat Ferrel wrote: There are two cuts applied in the batch calc: 1) number of interactions per user 2) number of items in the

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Ted Dunning
On Sat, Apr 18, 2015 at 11:29 AM, Pat Ferrel wrote: > You seem to be proposing a new cut by frequency of item interaction, is > this correct? This is because the frequency is known before the multiply > and LLR. I assume the #2 cut is left in place? > Yes. but I didn't think it was new.

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Andrew Musselman
Cool On Saturday, April 18, 2015, Ted Dunning wrote: > > Andrew > > Take a look at the slides I posted. In them I showed that the update does > not grow beyond a very reasonable bound. > > Sent from my iPhone > > > On Apr 18, 2015, at 9:15, Andrew Musselman > wrote: > > > > Yes that's what I m

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Ted Dunning
Andrew Take a look at the slides I posted. In them I showed that the update does not grow beyond a very reasonable bound. Sent from my iPhone > On Apr 18, 2015, at 9:15, Andrew Musselman wrote: > > Yes that's what I mean; if the number of updates gets too big it probably > would be unmana

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Andrew Musselman
Yes that's what I mean; if the number of updates gets too big it probably would be unmanageable though. This approach worked well with daily updates, but never tried it with anything "real time." On Saturday, April 18, 2015, Pat Ferrel wrote: > I think you are saying that instead of val newHash

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Pat Ferrel
I think you are saying that instead of val newHashMap = lastHashMap ++ updateHashMap, layered updates might be useful since new and last are potentially large. Some limit of updates might trigger a refresh. This might work if the update works with incremental index updates in the search engine.

Re: Streaming and incremental cooccurrence

2015-04-18 Thread Pat Ferrel
There are two cuts applied in the batch calc: 1) number of interactions per user 2) number of items in the resulting cooccurrence vectors (calc LLR, sort, lowest items cut per limit) You seem to be proposing a new cut by frequency of item interaction, is this correct? This is because the frequen

Re: Streaming and incremental cooccurrence

2015-04-17 Thread Andrew Musselman
I have not implemented it for recommendations but a layered cache/sieve structure could be useful. That is, between batch refreshes you can keep tacking on new updates in a cascading order so values that are updated exist in the newest layer but otherwise the lookup goes for the latest updated lay

Re: Streaming and incremental cooccurrence

2015-04-17 Thread Ted Dunning
Yes. Also add the fact that the nano batches are bounded tightly in size both max and mean. And mostly filtered away anyway. Aging is an open question. I have never seen any effect of alternative sampling so I would just assume "keep oldest" which just tosses more samples. Then occasionally r

Re: Streaming and incremental cooccurrence

2015-04-17 Thread Pat Ferrel
Thanks. This idea is based on a micro-batch of interactions per update, not individual ones unless I missed something. That matches the typical input flow. Most interactions are filtered away by frequency and number of interaction cuts. A couple practical issues In practice won’t this requir

Re: Streaming and incremental cooccurrence

2015-04-17 Thread Ted Dunning
When I think of real-time adaptation of indicators, I think of this: http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel wrote: > I’ve been thinking about Streaming (continuous input) and incr

Streaming and incremental cooccurrence

2015-04-17 Thread Pat Ferrel
I’ve been thinking about Streaming (continuous input) and incremental coccurrence. As interactions stream in from the user it it fairly simple to use something like Spark streaming to maintain a moving time window for all input, and an update frequency that recalcs all input currently in the ti