Re: lucene in combination with pattern recognition...

Bob Carpenter Thu, 22 Jun 2006 13:11:25 -0700

Check out Andrew McCallum's paper:

http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf


It mentions this very problem.  There are
also some more technical presentations around.

He was part of the Whiz-Bang team that took
on the problem.  The fact that the company's
out of business is a testament to how hard
this problem is in general.

- Bob Carpenter
  Alias-i


i'm looking at a problem and i can't figure out how to "easily" solve it...

basically, i'm trying to figure out if there's a way to use lucene/nutch
with some form of pattern matching to extract course information from a
College/Registrar's course section...

Assume I can point to a Regiatrar's section of a College site.
Assume I can then crawl through the section, and capture
 all the underlying information, including the Course
 information...
Is there a way to somehow use pattern matching/recognition
 to somehow interpret the DOM to pull out the class schedule
 information. I'm pretty sure there's no vanilla approach,
 so I'd even consider some kind of solution where I might
 have to intially evaluate/analyze the site, to tell it
 what DOM elements are "important"...

anyone done any work/projects like this...
any research/papers/sample apps i could look at...
any thoughts/comments/etc....

i could brute force this by writing a bunch of perl
scripts, with each script tied to a given registrar site,
but i'd like a more generalizable solution if one exists..

thanks

-bruce



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene in combination with pattern recognition...

Reply via email to