[google-appengine] Algorithm advice: Help design a twist on the digg scoring model

esilver Mon, 09 Mar 2009 17:17:49 -0700

Hi folks,

I have an algorithm question for the group that I've been toying with
for awhile and really want to get some fresh thoughts on. It's kind of
a twist on the digg scoring model, so I thought it might be
interesting for a lot of people on this list.


My site is http://www.wordds.com. It's basically like Google Reader +
Digg for feature-length articles from magazines. Content is not user-
contributed; instead we crawl about a dozen popular magazine sites.

Presently the home page lists recent articles, sorted into categories
based on word count. The algorithm that scores each article to
determine the order they appear on the home page (which is the same as
the recent page) sucks. I wrote it so I can say so. I'm constantly
dissatisfied and am manually tuning it too much.

The key requirement I am trying to satisfy is heterogeneity. I don't
want clumps of articles from the same source appearing next to each
other in any given word-count bucket.

Let's focus on one word count bucket, and say we are loading data from
three magazines, A, B, and C, represented by three arrays of articles:
[A1, A2, A3], [B1, B2], [C1, C2, C3, C4, C5]. If we are loading these
articles at the same time, the proper display order should be: [A1,
B1, C1, A2, B2, C2, A3, C3, C4, C5]. Any ordering of A1, B1, and C1 is
fine, any ordering of A2, B2, C2 is fine, etc, so we could reasonably
have [B1, C1, A1, A2, C2, B2, ...], but we should never have [A1, A2,
B1, C1, ...]

As we go further out in time, this restriction gets relaxed, such that
if we load a new source D a month later, where its array of articles
is [D1, D2, D3], it would be acceptable to have [D1, D2, A1, D3, B1,
C1, ...]

What I've realized is that some magazine sources publish daily/weekly
(duh), others publish monthly (duh), and the algorithm I have in place
is causing the daily/weekly magazines to rise too fast in my "recent"
results. Therefore, let's say that A publishes daily, B publishes
weekly, and C publishes monthly. On day 2, A publishes the following 2
new articles: [A4, A5]. The proper 'recent articles' display order
should now be:

[A4, B1, C1, A5, B2, C2, A1, C3, ...]

There are also source-specific recent pages. Thus, if you filter just
on source A, the proper display order is:
[A4, A5, A1, A2, A3]. Again, the order of article display for articles
loaded at the same time is arbitrary, so [A5, A4, A1, ...] is
acceptable too.

Back to our 'recent articles' home page: because source B publishes
weekly, it should hold its spot as the second article in the list for
up to a week. Let's say B gets lazy and doesn't publish (writers
strike). A continues its daily schedule, so now it's OK if B starts
losing some ground: one display order could be: [A10, A11, C1, A9, B1,
C2, ]

Notice that C is still holding its positions in the display order
because he publishes monthly. Eventually C would refresh and kick out
its older articles from the top of the list. And if C didn't refresh
it would be slowly replaced by A or other sources that did.

* * *

I read the article at http://code.google.com/appengine/articles/overheard.html
that talks about voting and decay. That works great for my voting
code, but not for my recent articles code. Here is my 'recent' scoring
algorithm:

# wc_bucket is the word count bucket this article is in
# wc_bucket_ct is a per-source dictionary

created = (now - datetime.datetime(2008, 10, 1)).days

if wc_bucket not in wc_bucket_ct:
   wc_bucket_ct[wc_bucket] = 1

strata = wc_bucket_ct[wc_bucket]
wc_bucket_ct[wc_bucket] = strata + 1

rnd = ((settings.DAY_SCALE * 14) / strata) + random.random() # up to a
2-wk boost
score = (created * settings.DAY_SCALE) + rnd


You can see in this code that I'm breaking up articles into various
strata and then adding a random amount to get that heterogeneity.
DAY_SCALE is set to 2, meaning each additional day is worth 2 more
points. But, this code totally doesn't capture the daily/weekly/
monthly aspect of sources, which I think is a piece of state I will
have to manually add.

Anyhow, if anyone on this list likes these sorts of algorithmic
puzzles and is inclined to propose an elegant solution, I'm all ears.
If I haven't explained anything well enough, let me know. And who
knows, maybe this would make a good interview question for someone :-)

Cheers folks,
-Eric


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Algorithm advice: Help design a twist on the digg scoring model

Reply via email to