You might also check out an old paper by Kruger, Giles, Lawrence et al. on
a search engine called Deadliner (see here at
http://clgiles.ist.psu.edu/papers/CIKM-2000-deadliner.pdf).
Deadliner crawled for Calls for Papers for conferences, using Support
Vector Machines trained to
recognise relevant pages, and then applying sets of regular expressions
to extract information
from the CFP pages. Lawrence is now with Google, I believe.
Hope this helps,
Simon
Bob Carpenter wrote:
Check out Andrew McCallum's paper:
http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf
It mentions this very problem. There are
also some more technical presentations around.
He was part of the Whiz-Bang team that took
on the problem. The fact that the company's
out of business is a testament to how hard
this problem is in general.
- Bob Carpenter
Alias-i
i'm looking at a problem and i can't figure out how to "easily" solve
it...
basically, i'm trying to figure out if there's a way to use lucene/nutch
with some form of pattern matching to extract course information from a
College/Registrar's course section...
Assume I can point to a Regiatrar's section of a College site.
Assume I can then crawl through the section, and capture
all the underlying information, including the Course
information...
Is there a way to somehow use pattern matching/recognition
to somehow interpret the DOM to pull out the class schedule
information. I'm pretty sure there's no vanilla approach,
so I'd even consider some kind of solution where I might
have to intially evaluate/analyze the site, to tell it
what DOM elements are "important"...
anyone done any work/projects like this...
any research/papers/sample apps i could look at...
any thoughts/comments/etc....
i could brute force this by writing a bunch of perl
scripts, with each script tied to a given registrar site,
but i'd like a more generalizable solution if one exists..
thanks
-bruce
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email: [EMAIL PROTECTED] Web: http://users.cscs.wmin.ac.uk/~courtes |
http://www.sse.wmin.ac.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]