If you are running linux, i recommend before indexing with lucene, you use the program lynx with the option -dump which dumps the formatted text without the tags, and runs really really fast in most cases.
- andy g On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Clean up seems cleaner. Just extract the textual information from HTML > using NekoHTML or JTidy or HTMLParser (.sf.net) or some such. > > You can also get fancy and preserve the 'structural' information (e.g. > H1 text is more important that H2, which is more important than BODY, > which is more important that DIV, etc.) and combine it with field > boosting at index time. > > Otis > > > > --- Sebastian Ho <[EMAIL PROTECTED]> wrote: > > Hi > > > > This is a typical web crawler, indexing and search application > > development. I have wrote my crawler and planning to add lucene in > > next. > > One questions pop to my mind, in terms of performance, do i clean up > > the > > html removing all tags before indexing, or i add all tags into the > > ignore list during indexing/search stage. > > > > Which is better? > > > > Thanks > > > > Sebastian Ho > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]