Ken, This sounds like (another) good idea.
What I would suggest is that you have three or four kinds of dates: - the date the page itself happened (if you can get that from page headers) - any dates that appear textually near a phrase or on the same page as a phrase - the date that a page linking to a page was created. - any dates that appear textually near a link to a page. For clustering, what I would do first is to try to find significant association between dates and phrases. Since you want association due to proximity in time as well as proximity in text, you probably should encode the dates at several levels of resolution. My stab at that would be to use semi-overlapping intervals that vary in size from a single day to about a year. The analysis I am suggesting would proceed roughly as follows: - find interesting phrases by simple cooccurrence analysis or even just by frequency - find interesting cooccurrence of words and phrases with any particular time period from any of the four date sources above. You could probably abuse the LDA program into giving you something interesting about topic spikes by rearranging all of your text into "documents" that are slices of time. The topic distribution for each period would vary when something of interest for a short period of time turns up. My guess is that the cooccurrence based stuff would give you better results in the short-run because it would have an easier time finding specific examples of interest. On Mon, Nov 16, 2009 at 11:14 AM, Ken Krugler <[email protected]>wrote: > So I'm going to toss out one idea... > > - I'd like to automatically generate a timeline of events. > > - I can extract dates and 2-to-4 word "terms" from web pages. > > - Could I use LDA to create clusters of common terms for dates? > > I don't want to get into named entity extraction, or anything that involves > more than simple data extraction and then the application of a scalable > algorithm currently supported by Mahout. > > Looking for feedback on the above, in terms of feasibility, level of > effort, interest by others to help, etc. > > Or if somebody else has a suggestion for something simple that could > provide interesting/obvious results, I'm all ears. > -- Ted Dunning, CTO DeepDyve
