Re: [slightly off topic] Determining Importance

Ken Krugler Mon, 03 Jan 2011 10:08:36 -0800

Hi Grant,

On Jan 3, 2011, at 8:54am, Grant Ingersoll wrote:

Hi,
I wanted to pick people's brains a little bit on the subject ofdetermining importance. This isn't necessarily Mahout related,although I think we have some tools that help in the area.
One of the emerging trends it seems these days with all ourconnectivity and content is a notion of importance/priority. Someexamples:1. Google now has "Priority Inbox" for instance and I think mostwould agree that for things like Twitter and Facebook it would bereally nice if you could separate out the Important updates/peoplefrom the less important.
2. Identifying important phrases, etc. in text across a corpus.
3. One of the things I think most researchers do when exploring anew topic is to identify the one or two seminal papers in the field,read them, and then read the ones that cite those papers and so on.4. Take in all the day's news and figure out what the key articlesare to read (in some sense it's picking the most representativedocument in a cluster) or that the article talking about raisingFederal income taxes is likely more important
than the one talking about raising local sales tax (or vice versa!)
5. PageRank, TextRank, etc. and other approaches to calculatingauthority
What I'm looking for is help in researching this area. Is there aname for this (sub-)field (importance theory? prioritizationtheory?), particularly in mach. learning and NLP that is gearedtowards this? I realize some (most) of these problems can be solvedwith classifiers amongst other things like graph algorithms(particularly ones that use the social graph), but it also seemslike the area is bigger than a particular implementation, so Iwanted to hear what others thought. How would you go about solvingthese problems? Do you have any pointers to useful references onthe subject (theoretical or practical)? What other examples haveyou run up against?


For what it's worth, we took a run at this issue last February...

1. Collect all of your tweets, and the tweets of people you follow,where the tweet has a URL.

2. Assign importance based on you (high) and the people you follow(depends on # of followers)


3. Fetch and parse referenced pages.

4. Use Mahout's kmeans to generate 50 clusters or so.

5. Take the top clusters (up to 5), where "top" means a tight groupingand significant number of members.

6. Use these top clusters to filter all tweets in the firehose, togenerate a ranked list of "important" tweets.

The main challenge here was getting good clustering results. We trieda number of different sparsification techniques on the page data, butby the end of our target deadline we still weren't getting greatresults in identifying nice, crisp "topics" that were likely to be ofinterest. It feels like something that would have eventually gottengood enough, if we'd spent a lot more time playing with all of thecombinations, but we stuck to our timeline and wound up putting thaton the shelf.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: [slightly off topic] Determining Importance

Reply via email to