Hello Bruce,
is there a way to set lucene so that it only parses/crawls through a given portion of a website...
Lucene does not crawl anything on its own. It is simply a search engine library. All indexing and crawling must be done by an application that you create.
also, when lucene returns data, can it be slammed/put into a large file/db structure so i can extract the requisite information from it.
I know that it is possible to setup lucene to store its data in a SQL database
btw, how does lucene/nutch compare to heritrix?
I'm not sure, I have never used heritrix. Hope some of this was helpful. --Mike On 6/20/06, bruce <[EMAIL PROTECTED]> wrote:
hi.. is there a way to set lucene so that it only parses/crawls through a given portion of a website... i have a college site. i'm looking at simply extracting all the information for a given section of the site, ie the registrar section... if i can determine that i want all the pages underneath a given url, can lucene be used as a possible solution... also, when lucene returns data, can it be slammed/put into a large file/db structure so i can extract the requisite information from it. i'm wondering if i already more or less know the DOM structure of the information i'm looking for, i could simply crawl the given section of the college site, if i could figure out a way to limit the amount of information that's returned.. i could then do a regex kind of search across the returned pages... my overall goal is to extract certain pieces of information from the section of the college site that i crawl... btw, how does lucene/nutch compare to heritrix? thanks... -bruce --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]