Re: Streaming and incremental cooccurrence

Ted Dunning Sat, 18 Apr 2015 16:42:04 -0700

I just answered in another thread.

Yes, I am.  I just didn't think I was proposing it.  Thought it was in
Sebastian's paper and ultimately in our code (that I haven't looked at in
over a year).



On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Hey Ted,
>
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? It is performed before the multiply, right?
>
>
> On Apr 18, 2015, at 8:29 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> There are two cuts applied in the batch calc:
> 1) number of interactions per user
> 2) number of items in the resulting cooccurrence vectors (calc LLR, sort,
> lowest items cut per limit)
>
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? This is because the frequency is known before the multiply
> and LLR. I assume the #2 cut is left in place?
>
> If so you would need something like 4 things with random accesss/in memory
> speed
> 1&2) frequency vectors for interactions pre user and item, these may be
> updated and are used to calculate LLR and also for cutting new update
> interactions.
> 3) cooccurrence matrix with LLR weights—this is also stored in search
> engine or DB (without weights) so any update needs to trigger an engine
> index update.
> 4) item dictionary
>
> #3 might not be a matrix but a hashmap key = item-id, value = vector of
> items. If the vector has item keys = int you would also need the item
> dictionary.
>
> Given the frequency vectors it seems like the interaction matrices are no
> longer needed?
>
>
> On Apr 17, 2015, at 7:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>
> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
>
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
>
> Search updates any more are true realtime also so that works very well.
>
> Sent from my iPhone
>
> > On Apr 17, 2015, at 17:20, Pat Ferrel <p...@occamsmachete.com> wrote:
> >
> > Thanks.
> >
> > This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
> >
> > A couple practical issues
> >
> > In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
> >
> > To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
> >
> > Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
> >
> > As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
> >
> > On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> >
> >
> > When I think of real-time adaptation of indicators, I think of this:
> >
> >
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> >
> >
> >> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
> >> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
> >>
> >> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
> >>
> >> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
> >>
> >> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
> >
> >
>
>
>

Re: Streaming and incremental cooccurrence

Reply via email to