It does bring up a nice way to order the items in the A and B docs, by 
timestamp if available. That way when you get an h_b doc from B for the query:

recommend based on behavior with regard to B items and actions h_b
      query is [b-b-links: h_b]

the h_b items are ordered by recency. You can truncate based on the number of 
actions you want to consider. This should be very easy to implement if only we 
could attach data to the items in the DRMs

Actually this brings up another point that I've harped on before. It sure would 
be nice to have a vector representation where you could attache arbitrary data 
to items or vectors. Not so memory efficient but it makes things like ID 
translation and timestamping actions trivial. If these could be attached and 
survive all the Mahout jobs there would be no need for the in-memory hashmap 
I'm using to translate IDs and the actions could be timestamped or other 
metadata could be attached. At present I guess everyone knows that only weights 
are attached to actions/matrix values and in some cases names to rows/vectors 
in DRMs. 


On Aug 4, 2013, at 12:59 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

On Sun, Aug 4, 2013 at 9:35 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> 2) This is not ideal way to downsample if I understand the code. It keeps
> the first items ingested which has nothing to do with their timestamp.
> You'd ideally want to truncate based on the order the actions were taken by
> the user keeping the newest.



There are at least three options for down-sampling.  All have arguments in
their favor and probably have good applications.  I don't think it actually
matters, however, since down-sampling should mostly be applied to
pathological cases like bots or QA teams.

The options that I know of include:

1) take the first events you see.  This is easy.  For content, it may be
best to do this because this gives you information about the context of the
content when it first appears.  For users, this may be worst as a
characterization of the user now, but it may be near best for the off-line
item-item analysis because it preserves a densely sampled view of some past
moment in time.

2) take the last events you see.  This is also easy, but not quite as easy
as (1) since you can't stop early if you see the data in chronological
order.  For content, this gives you the latest view of the content and
pushes all data for all items into the same time frame which might increase
overlap in the offline analysis. For users at recommendation, it is
probably exactly what you want.

3) take some time-weighted sampling that is in-between these two options.
You can do reservoir sampling to get a fair sample or you can to random
replacement which weights the recent past more heavily than the far past.
Both of these are attractive for various reasons.  The strongest argument
for recency weighted sampling is probably that it is hard to decide between
(1) and (2).

As stated above, however, this probably doesn't much matter since the
sampling being done in the off-line analysis is mostly only applied to
crazy users or stuff so popular that any sample will do.

Reply via email to