shrinath.m <shrinat...@webyog.com> wrote:

> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?

In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages
by selecting only text between certain tags, before indexing them.
These are offline Web pages, as in your application.  Take a look at 
<http://uplib.parc.com/hg/uplib/file/2a204fc2dd1a/extensions/FilterWebPage.py>.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to