I just answered in another thread. Yes, I am. I just didn't think I was proposing it. Thought it was in Sebastian's paper and ultimately in our code (that I haven't looked at in over a year).
On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Hey Ted, > > You seem to be proposing a new cut by frequency of item interaction, is > this correct? It is performed before the multiply, right? > > > On Apr 18, 2015, at 8:29 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > There are two cuts applied in the batch calc: > 1) number of interactions per user > 2) number of items in the resulting cooccurrence vectors (calc LLR, sort, > lowest items cut per limit) > > You seem to be proposing a new cut by frequency of item interaction, is > this correct? This is because the frequency is known before the multiply > and LLR. I assume the #2 cut is left in place? > > If so you would need something like 4 things with random accesss/in memory > speed > 1&2) frequency vectors for interactions pre user and item, these may be > updated and are used to calculate LLR and also for cutting new update > interactions. > 3) cooccurrence matrix with LLR weights—this is also stored in search > engine or DB (without weights) so any update needs to trigger an engine > index update. > 4) item dictionary > > #3 might not be a matrix but a hashmap key = item-id, value = vector of > items. If the vector has item keys = int you would also need the item > dictionary. > > Given the frequency vectors it seems like the interaction matrices are no > longer needed? > > > On Apr 17, 2015, at 7:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > Yes. Also add the fact that the nano batches are bounded tightly in size > both max and mean. And mostly filtered away anyway. > > Aging is an open question. I have never seen any effect of alternative > sampling so I would just assume "keep oldest" which just tosses more > samples. Then occasionally rebuild from batch if you really want aging to > go right. > > Search updates any more are true realtime also so that works very well. > > Sent from my iPhone > > > On Apr 17, 2015, at 17:20, Pat Ferrel <p...@occamsmachete.com> wrote: > > > > Thanks. > > > > This idea is based on a micro-batch of interactions per update, not > individual ones unless I missed something. That matches the typical input > flow. Most interactions are filtered away by frequency and number of > interaction cuts. > > > > A couple practical issues > > > > In practice won’t this require aging of interactions too? So wouldn’t > the update require some old interaction removal? I suppose this might just > take the form of added null interactions representing the geriatric ones? > Haven’t gone through the math with enough detail to see if you’ve already > accounted for this. > > > > To use actual math (self-join, etc.) we still need to alter the geometry > of the interactions to have the same row rank as the adjusted total. In > other words the number of rows in all resulting interactions must be the > same. Over time this means completely removing rows and columns or allowing > empty rows in potentially all input matrices. > > > > Might not be too bad to accumulate gaps in rows and columns. Not sure if > it would have a practical impact (to some large limit) as long as it was > done, to keep the real size more or less fixed. > > > > As to realtime, that would be under search engine control through > incremental indexing and there are a couple ways to do that, not a problem > afaik. As you point out the query always works and is real time. The index > update must be frequent and not impact the engine's availability for > queries. > > > > On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > > > > When I think of real-time adaptation of indicators, I think of this: > > > > > http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime > > > > > >> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <p...@occamsmachete.com> > wrote: > >> I’ve been thinking about Streaming (continuous input) and incremental > coccurrence. > >> > >> As interactions stream in from the user it it fairly simple to use > something like Spark streaming to maintain a moving time window for all > input, and an update frequency that recalcs all input currently in the time > window. I’ve done this with the current cooccurrence code but though > streaming, this is not incremental. > >> > >> The current data flow goes from interaction input to geometry and user > dictionary reconciliation to A’A, A’B etc. After the multiply the resulting > cooccurrence matrices are LLR weighted/filtered/down-sampled. > >> > >> Incremental can mean all sorts of things and may imply different > trade-offs. Did you have anything specific in mind? > > > > > > >