Check out Andrew McCallum's paper: http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf
It mentions this very problem. There are also some more technical presentations around. He was part of the Whiz-Bang team that took on the problem. The fact that the company's out of business is a testament to how hard this problem is in general. - Bob Carpenter Alias-i
i'm looking at a problem and i can't figure out how to "easily" solve it... basically, i'm trying to figure out if there's a way to use lucene/nutch with some form of pattern matching to extract course information from a College/Registrar's course section... Assume I can point to a Regiatrar's section of a College site. Assume I can then crawl through the section, and capture all the underlying information, including the Course information... Is there a way to somehow use pattern matching/recognition to somehow interpret the DOM to pull out the class schedule information. I'm pretty sure there's no vanilla approach, so I'd even consider some kind of solution where I might have to intially evaluate/analyze the site, to tell it what DOM elements are "important"... anyone done any work/projects like this... any research/papers/sample apps i could look at... any thoughts/comments/etc.... i could brute force this by writing a bunch of perl scripts, with each script tied to a given registrar site, but i'd like a more generalizable solution if one exists.. thanks -bruce --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]