Re: clean up html before indexing or add tags to ignore list

Andy Goodell Thu, 13 May 2004 09:05:38 -0700

If you are running linux, i recommend before indexing with lucene, you
use the program lynx with the option -dump which dumps the formatted
text without the tags, and runs really really fast in most cases.


- andy g

On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> 
> Clean up seems cleaner.  Just extract the textual information from HTML
> using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.
> 
> You can also get fancy and preserve the 'structural' information (e.g.
> H1 text is more important that H2, which is more important than BODY,
> which is more important that DIV, etc.) and combine it with field
> boosting at index time.
> 
> Otis
> 
> 
> 
> --- Sebastian Ho <[EMAIL PROTECTED]> wrote:
> > Hi
> >
> > This is a typical web crawler, indexing and search application
> > development. I have wrote my crawler and planning to add lucene in
> > next.
> > One questions pop to my mind, in terms of performance, do i clean up
> > the
> > html removing all tags before indexing, or i add all tags into the
> > ignore list during indexing/search stage.
> >
> > Which is better?
> >
> > Thanks
> >
> > Sebastian Ho
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: clean up html before indexing or add tags to ignore list

Reply via email to