There are two cuts applied in the batch calc:
1) number of interactions per user
2) number of items in the resulting cooccurrence vectors (calc LLR, sort, 
lowest items cut per limit)

You seem to be proposing a new cut by frequency of item interaction, is this 
correct? This is because the frequency is known before the multiply and LLR. I 
assume the #2 cut is left in place?

If so you would need something like 4 things with random accesss/in memory speed
1&2) frequency vectors for interactions pre user and item, these may be updated 
and are used to calculate LLR and also for cutting new update interactions.
3) cooccurrence matrix with LLR weights—this is also stored in search engine or 
DB (without weights) so any update needs to trigger an engine index update.
4) item dictionary

#3 might not be a matrix but a hashmap key = item-id, value = vector of items. 
If the vector has item keys = int you would also need the item dictionary.

Given the frequency vectors it seems like the interaction matrices are no 
longer needed?


On Apr 17, 2015, at 7:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:


Yes. Also add the fact that the nano batches are bounded tightly in size both 
max and mean. And mostly filtered away anyway. 

Aging is an open question. I have never seen any effect of alternative sampling 
so I would just assume "keep oldest" which just tosses more samples. Then 
occasionally rebuild from batch if you really want aging to go right.  

Search updates any more are true realtime also so that works very well. 

Sent from my iPhone

> On Apr 17, 2015, at 17:20, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> Thanks. 
> 
> This idea is based on a micro-batch of interactions per update, not 
> individual ones unless I missed something. That matches the typical input 
> flow. Most interactions are filtered away by  frequency and number of 
> interaction cuts.
> 
> A couple practical issues
> 
> In practice won’t this require aging of interactions too? So wouldn’t the 
> update require some old interaction removal? I suppose this might just take 
> the form of added null interactions representing the geriatric ones? Haven’t 
> gone through the math with enough detail to see if you’ve already accounted 
> for this.
> 
> To use actual math (self-join, etc.) we still need to alter the geometry of 
> the interactions to have the same row rank as the adjusted total. In other 
> words the number of rows in all resulting interactions must be the same. Over 
> time this means completely removing rows and columns or allowing empty rows 
> in potentially all input matrices.
> 
> Might not be too bad to accumulate gaps in rows and columns. Not sure if it 
> would have a practical impact (to some large limit) as long as it was done, 
> to keep the real size more or less fixed.
> 
> As to realtime, that would be under search engine control through incremental 
> indexing and there are a couple ways to do that, not a problem afaik. As you 
> point out the query always works and is real time. The index update must be 
> frequent and not impact the engine's availability for queries.
> 
> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> 
> When I think of real-time adaptation of indicators, I think of this:
> 
> http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-indicator-recommendations-in-realtime
> 
> 
>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> I’ve been thinking about Streaming (continuous input) and incremental 
>> coccurrence.
>> 
>> As interactions stream in from the user it it fairly simple to use something 
>> like Spark streaming to maintain a moving time window for all input, and an 
>> update frequency that recalcs all input currently in the time window. I’ve 
>> done this with the current cooccurrence code but though streaming, this is 
>> not incremental.
>> 
>> The current data flow goes from interaction input to geometry and user 
>> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting 
>> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>> 
>> Incremental can mean all sorts of things and may imply different trade-offs. 
>> Did you have anything specific in mind?
> 
> 

Reply via email to