[EMAIL PROTECTED] wrote:
On 5/12/05, Robert Goene <[EMAIL PROTECTED]> wrote:

I am trying to extend the current HTMLParser of lenya 1.2.1 to support
keywords.

Is there an xml parser for lucene somewhere? Should be fairly easy. The documents that i am indexing are xhtml, so there is no need for a parser that can handle those illegal html files.


I am trying to understand the purpose of this, so let me know if this
answer if completely off-topic.  I believe your issue can be solved
without touching Java.

Completely on-topic.

I do not think Lucene cares whether data is HTML or XML; it treats it all as XML. I have not tried it with poorly written HTML, since Lenya always closes tags in the correct order, and I have only used Lucene with Lenya.

Lucene can index data (removing all tags) into several fields which
can be used by search.  The default is to crawl a website for all HTML
pages, then index the entire page into a "content" field.  My version
of search indexes the XML documents in {pub}/content/live, keeps the
"content" field, and adds fields for "language", "title", and
"description".  Each field is configured using an XPATH expression.

So the easy answer should be:
1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
keywords are displayed in the header so they can be accessed using
XPATH.
2. Configure Lucene to add keywords to a new field.  Create the index.
3. Change the Search page to allow selection by keywords.


This only leaces me with the question how i should add the keywords. Right now, it is just one string with a \n seperator for the different keywords. I would also like to add a boost factor to the individual keywords.


The alternative would be a nice extension of the Lenya GUI to edit an xml list of keywords and boost factor. This sounds more lenya-like to a lenya newbie as i am. Any suggestions?


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to