Hi Grant,
On Jan 3, 2011, at 8:54am, Grant Ingersoll wrote:
Hi,
I wanted to pick people's brains a little bit on the subject of
determining importance. This isn't necessarily Mahout related,
although I think we have some tools that help in the area.
One of the emerging trends it seems these days with all our
connectivity and content is a notion of importance/priority. Some
examples:
1. Google now has "Priority Inbox" for instance and I think most
would agree that for things like Twitter and Facebook it would be
really nice if you could separate out the Important updates/people
from the less important.
2. Identifying important phrases, etc. in text across a corpus.
3. One of the things I think most researchers do when exploring a
new topic is to identify the one or two seminal papers in the field,
read them, and then read the ones that cite those papers and so on.
4. Take in all the day's news and figure out what the key articles
are to read (in some sense it's picking the most representative
document in a cluster) or that the article talking about raising
Federal income taxes is likely more important
than the one talking about raising local sales tax (or vice versa!)
5. PageRank, TextRank, etc. and other approaches to calculating
authority
What I'm looking for is help in researching this area. Is there a
name for this (sub-)field (importance theory? prioritization
theory?), particularly in mach. learning and NLP that is geared
towards this? I realize some (most) of these problems can be solved
with classifiers amongst other things like graph algorithms
(particularly ones that use the social graph), but it also seems
like the area is bigger than a particular implementation, so I
wanted to hear what others thought. How would you go about solving
these problems? Do you have any pointers to useful references on
the subject (theoretical or practical)? What other examples have
you run up against?
For what it's worth, we took a run at this issue last February...
1. Collect all of your tweets, and the tweets of people you follow,
where the tweet has a URL.
2. Assign importance based on you (high) and the people you follow
(depends on # of followers)
3. Fetch and parse referenced pages.
4. Use Mahout's kmeans to generate 50 clusters or so.
5. Take the top clusters (up to 5), where "top" means a tight grouping
and significant number of members.
6. Use these top clusters to filter all tweets in the firehose, to
generate a ranked list of "important" tweets.
The main challenge here was getting good clustering results. We tried
a number of different sparsification techniques on the page data, but
by the end of our target deadline we still weren't getting great
results in identifying nice, crisp "topics" that were likely to be of
interest. It feels like something that would have eventually gotten
good enough, if we'd spent a lot more time playing with all of the
combinations, but we stuck to our timeline and wound up putting that
on the shelf.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g