Short answer, you are correct this is not a new filter. The Hadoop MapReduce implements: * maxSimilaritiesPerItem * maxPrefs * minPrefsPerUser * threshold
Scala version: * maxSimilaritiesPerItem * maxPrefs The paper talks about an interaction-cut, and describes it with "There is no significant decrease in the error for incorporating more interactions from the ‘power users’ after that.” While I’d trust your reading better than mine I thought that meant dowsampling overactive users. However both the Hadoop Mapreduce and the Scala version downsample both user and item interactions by maxPrefs. So you are correct, not a new thing. The paper also talks about the threshold and we’ve talked on the list about how better to implement that. A fixed number is not very useful so a number of sigmas was proposed but is not yet implemented. On Apr 18, 2015, at 4:39 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: I just answered in another thread. Yes, I am. I just didn't think I was proposing it. Thought it was in Sebastian's paper and ultimately in our code (that I haven't looked at in over a year). On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Hey Ted, > > You seem to be proposing a new cut by frequency of item interaction, is > this correct? It is performed before the multiply, right? > > > On Apr 18, 2015, at 8:29 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > There are two cuts applied in the batch calc: > 1) number of interactions per user > 2) number of items in the resulting cooccurrence vectors (calc LLR, sort, > lowest items cut per limit) > > You seem to be proposing a new cut by frequency of item interaction, is > this correct? This is because the frequency is known before the multiply > and LLR. I assume the #2 cut is left in place? > > If so you would need something like 4 things with random accesss/in memory > speed > 1&2) frequency vectors for interactions pre user and item, these may be > updated and are used to calculate LLR and also for cutting new update > interactions. > 3) cooccurrence matrix with LLR weights—this is also stored in search > engine or DB (without weights) so any update needs to trigger an engine > index update. > 4) item dictionary > > #3 might not be a matrix but a hashmap key = item-id, value = vector of > items. If the vector has item keys = int you would also need the item > dictionary. > > Given the frequency vectors it seems like the interaction matrices are no > longer needed? > > > On Apr 17, 2015, at 7:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > Yes. Also add the fact that the nano batches are bounded tightly in size > both max and mean. And mostly filtered away anyway. > > Aging is an open question. I have never seen any effect of alternative > sampling so I would just assume "keep oldest" which just tosses more > samples. Then occasionally rebuild from batch if you really want aging to > go right. > > Search updates any more are true realtime also so that works very well. > > Sent from my iPhone > >> On Apr 17, 2015, at 17:20, Pat Ferrel <p...@occamsmachete.com> wrote: >> >> Thanks. >> >> This idea is based on a micro-batch of interactions per update, not > individual ones unless I missed something. That matches the typical input > flow. Most interactions are filtered away by frequency and number of > interaction cuts. >> >> A couple practical issues >> >> In practice won’t this require aging of interactions too? So wouldn’t > the update require some old interaction removal? I suppose this might just > take the form of added null interactions representing the geriatric ones? > Haven’t gone through the math with enough detail to see if you’ve already > accounted for this. >> >> To use actual math (self-join, etc.) we still need to alter the geometry > of the interactions to have the same row rank as the adjusted total. In > other words the number of rows in all resulting interactions must be the > same. Over time this means completely removing rows and columns or allowing > empty rows in potentially all input matrices. >> >> Might not be too bad to accumulate gaps in rows and columns. Not sure if > it would have a practical impact (to some large limit) as long as it was > done, to keep the real size more or less fixed. >> >> As to realtime, that would be under search engine control through > incremental indexing and there are a couple ways to do that, not a problem > afaik. As you point out the query always works and is real time. The index > update must be frequent and not impact the engine's availability for > queries. >> >> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: >> >> >> When I think of real-time adaptation of indicators, I think of this: >> >> > http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime >> >> >>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <p...@occamsmachete.com> > wrote: >>> I’ve been thinking about Streaming (continuous input) and incremental > coccurrence. >>> >>> As interactions stream in from the user it it fairly simple to use > something like Spark streaming to maintain a moving time window for all > input, and an update frequency that recalcs all input currently in the time > window. I’ve done this with the current cooccurrence code but though > streaming, this is not incremental. >>> >>> The current data flow goes from interaction input to geometry and user > dictionary reconciliation to A’A, A’B etc. After the multiply the resulting > cooccurrence matrices are LLR weighted/filtered/down-sampled. >>> >>> Incremental can mean all sorts of things and may imply different > trade-offs. Did you have anything specific in mind? >> >> > > >