Re: HTMLParser

Robert Goene Fri, 13 May 2005 12:16:39 -0700

[EMAIL PROTECTED] wrote:

I am trying to extend the current HTMLParser of lenya 1.2.1 to support
keywords.

Lucene can index data (removing all tags) into several fields which
can be used by search.  The default is to crawl a website for all HTML
pages, then index the entire page into a "content" field.  My version
of search indexes the XML documents in {pub}/content/live, keeps the
"content" field, and adds fields for "language", "title", and
"description".  Each field is configured using an XPATH expression.

So the easy answer should be:
1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
keywords are displayed in the header so they can be accessed using
XPATH.
2. Configure Lucene to add keywords to a new field.  Create the index.
3. Change the Search page to allow selection by keywords.

This only leaves me with the question how i should add the keywords.
Right now, it is just one string with a \n seperator for the different
keywords. I would also like to add a boost factor to the individual
keywords.

The alternative would be a nice extension of the Lenya GUI to edit an
xml list of keywords and boost factor. This sounds more lenya-like to a
lenya newbie as i am. Any suggestions?

Thanks Michi (see his post): Lucene's default is for HTML, but any
configuration requires XML, so you'll be working with XML.

You can create a new "keywords" field for use by the Search front-end.
 Lucene indexes on words, so separating with a space works well.  It
does not do well separating using tags, because they are removed
without adding a whitespace separator.  (I think that is called a
bug.)

What business purpose would "boost" help?  Lucene would probably need
to be completely rewritten to support something like it.  Can you
design an interface that adds enough value to compensate for the extra
confusion?

The boost is a very nice Lucene function to finetune the index results. I need it, because my indexed documents will have very similar keywords and need a more sophisticated mechanism to control the search results. I think i'll take a look at the ConfigurableIndexer and maybe add a fieldtype to parse the content.

Does someone has a configuration file to index an xhtml file? I seem to be able to add fields to the index, but without any content...

Regards, Robert


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTMLParser

Reply via email to