Re: Streaming and incremental cooccurrence

Pat Ferrel Sun, 19 Apr 2015 08:07:19 -0700

Short answer, you are correct this is not a new filter.

The Hadoop MapReduce implements:
* maxSimilaritiesPerItem
* maxPrefs
* minPrefsPerUser
* threshold


Scala version:
* maxSimilaritiesPerItem
* maxPrefs

The paper talks about an interaction-cut, and describes it with "There is no 
significant decrease in the error for incorporating more interactions from the 
‘power users’ after that.” While I’d trust your reading better than mine I 
thought that meant dowsampling overactive users. 

However both the Hadoop Mapreduce and the Scala version downsample both user 
and item interactions by maxPrefs. So you are correct, not a new thing.

The paper also talks about the threshold and we’ve talked on the list about how 
better to implement that. A fixed number is not very useful so a number of 
sigmas was proposed but is not yet implemented. 


On Apr 18, 2015, at 4:39 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

I just answered in another thread.

Yes, I am.  I just didn't think I was proposing it.  Thought it was in
Sebastian's paper and ultimately in our code (that I haven't looked at in
over a year).


On Sat, Apr 18, 2015 at 7:38 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Hey Ted,
> 
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? It is performed before the multiply, right?
> 
> 
> On Apr 18, 2015, at 8:29 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> There are two cuts applied in the batch calc:
> 1) number of interactions per user
> 2) number of items in the resulting cooccurrence vectors (calc LLR, sort,
> lowest items cut per limit)
> 
> You seem to be proposing a new cut by frequency of item interaction, is
> this correct? This is because the frequency is known before the multiply
> and LLR. I assume the #2 cut is left in place?
> 
> If so you would need something like 4 things with random accesss/in memory
> speed
> 1&2) frequency vectors for interactions pre user and item, these may be
> updated and are used to calculate LLR and also for cutting new update
> interactions.
> 3) cooccurrence matrix with LLR weights—this is also stored in search
> engine or DB (without weights) so any update needs to trigger an engine
> index update.
> 4) item dictionary
> 
> #3 might not be a matrix but a hashmap key = item-id, value = vector of
> items. If the vector has item keys = int you would also need the item
> dictionary.
> 
> Given the frequency vectors it seems like the interaction matrices are no
> longer needed?
> 
> 
> On Apr 17, 2015, at 7:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> 
> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
> 
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
> 
> Search updates any more are true realtime also so that works very well.
> 
> Sent from my iPhone
> 
>> On Apr 17, 2015, at 17:20, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>> Thanks.
>> 
>> This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
>> 
>> A couple practical issues
>> 
>> In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
>> 
>> To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
>> 
>> Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
>> 
>> As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
>> 
>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>> 
>> 
>> When I think of real-time adaptation of indicators, I think of this:
>> 
>> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
>> 
>> 
>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
>>> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
>>> 
>>> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
>>> 
>>> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>>> 
>>> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
>> 
>> 
> 
> 
>

Re: Streaming and incremental cooccurrence

Reply via email to