Ken,

This sounds like (another) good idea.

What I would suggest is that you have three or four kinds of dates:

- the date the page itself happened (if you can get that from page headers)

- any dates that appear textually near a phrase or on the same page as a
phrase

- the date that a page linking to a page was created.

- any dates that appear textually near a link to a page.

For clustering, what I would do first is to try to find significant
association between dates and phrases.  Since you want association due to
proximity in time as well as proximity in text, you probably should encode
the dates at several levels of resolution.  My stab at that would be to use
semi-overlapping intervals that vary in size from a single day to about a
year.

The analysis I am suggesting would proceed roughly as follows:

- find interesting phrases by simple cooccurrence analysis or even just by
frequency

- find interesting cooccurrence of words and phrases with any particular
time period from any of the four date sources above.

You could probably abuse the LDA program into giving you something
interesting about topic spikes by rearranging all of your text into
"documents" that are slices of time.  The topic distribution for each period
would vary when something of interest for a short period of time turns up.
My guess is that the cooccurrence based stuff would give you better results
in the short-run because it would have an easier time finding specific
examples of interest.

On Mon, Nov 16, 2009 at 11:14 AM, Ken Krugler
<[email protected]>wrote:

> So I'm going to toss out one idea...
>
>  - I'd like to automatically generate a timeline of events.
>
>  - I can extract dates and 2-to-4 word "terms" from web pages.
>
> - Could I use LDA to create clusters of common terms for dates?
>
> I don't want to get into named entity extraction, or anything that involves
> more than simple data extraction and then the application of a scalable
> algorithm currently supported by Mahout.
>
> Looking for feedback on the above, in terms of feasibility, level of
> effort, interest by others to help, etc.
>
> Or if somebody else has a suggestion for something simple that could
> provide interesting/obvious results, I'm all ears.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to