Hi folks, I have an algorithm question for the group that I've been toying with for awhile and really want to get some fresh thoughts on. It's kind of a twist on the digg scoring model, so I thought it might be interesting for a lot of people on this list.
My site is http://www.wordds.com. It's basically like Google Reader + Digg for feature-length articles from magazines. Content is not user- contributed; instead we crawl about a dozen popular magazine sites. Presently the home page lists recent articles, sorted into categories based on word count. The algorithm that scores each article to determine the order they appear on the home page (which is the same as the recent page) sucks. I wrote it so I can say so. I'm constantly dissatisfied and am manually tuning it too much. The key requirement I am trying to satisfy is heterogeneity. I don't want clumps of articles from the same source appearing next to each other in any given word-count bucket. Let's focus on one word count bucket, and say we are loading data from three magazines, A, B, and C, represented by three arrays of articles: [A1, A2, A3], [B1, B2], [C1, C2, C3, C4, C5]. If we are loading these articles at the same time, the proper display order should be: [A1, B1, C1, A2, B2, C2, A3, C3, C4, C5]. Any ordering of A1, B1, and C1 is fine, any ordering of A2, B2, C2 is fine, etc, so we could reasonably have [B1, C1, A1, A2, C2, B2, ...], but we should never have [A1, A2, B1, C1, ...] As we go further out in time, this restriction gets relaxed, such that if we load a new source D a month later, where its array of articles is [D1, D2, D3], it would be acceptable to have [D1, D2, A1, D3, B1, C1, ...] What I've realized is that some magazine sources publish daily/weekly (duh), others publish monthly (duh), and the algorithm I have in place is causing the daily/weekly magazines to rise too fast in my "recent" results. Therefore, let's say that A publishes daily, B publishes weekly, and C publishes monthly. On day 2, A publishes the following 2 new articles: [A4, A5]. The proper 'recent articles' display order should now be: [A4, B1, C1, A5, B2, C2, A1, C3, ...] There are also source-specific recent pages. Thus, if you filter just on source A, the proper display order is: [A4, A5, A1, A2, A3]. Again, the order of article display for articles loaded at the same time is arbitrary, so [A5, A4, A1, ...] is acceptable too. Back to our 'recent articles' home page: because source B publishes weekly, it should hold its spot as the second article in the list for up to a week. Let's say B gets lazy and doesn't publish (writers strike). A continues its daily schedule, so now it's OK if B starts losing some ground: one display order could be: [A10, A11, C1, A9, B1, C2, ] Notice that C is still holding its positions in the display order because he publishes monthly. Eventually C would refresh and kick out its older articles from the top of the list. And if C didn't refresh it would be slowly replaced by A or other sources that did. * * * I read the article at http://code.google.com/appengine/articles/overheard.html that talks about voting and decay. That works great for my voting code, but not for my recent articles code. Here is my 'recent' scoring algorithm: # wc_bucket is the word count bucket this article is in # wc_bucket_ct is a per-source dictionary created = (now - datetime.datetime(2008, 10, 1)).days if wc_bucket not in wc_bucket_ct: wc_bucket_ct[wc_bucket] = 1 strata = wc_bucket_ct[wc_bucket] wc_bucket_ct[wc_bucket] = strata + 1 rnd = ((settings.DAY_SCALE * 14) / strata) + random.random() # up to a 2-wk boost score = (created * settings.DAY_SCALE) + rnd You can see in this code that I'm breaking up articles into various strata and then adding a random amount to get that heterogeneity. DAY_SCALE is set to 2, meaning each additional day is worth 2 more points. But, this code totally doesn't capture the daily/weekly/ monthly aspect of sources, which I think is a piece of state I will have to manually add. Anyhow, if anyone on this list likes these sorts of algorithmic puzzles and is inclined to propose an elegant solution, I'm all ears. If I haven't explained anything well enough, let me know. And who knows, maybe this would make a good interview question for someone :-) Cheers folks, -Eric --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---