> >>>>I am trying to extend the current HTMLParser of lenya 1.2.1 to support
> >>>>keywords.

> > Lucene can index data (removing all tags) into several fields which
> > can be used by search.  The default is to crawl a website for all HTML
> > pages, then index the entire page into a "content" field.  My version
> > of search indexes the XML documents in {pub}/content/live, keeps the
> > "content" field, and adds fields for "language", "title", and
> > "description".  Each field is configured using an XPATH expression.

> > So the easy answer should be:
> > 1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
> > keywords are displayed in the header so they can be accessed using
> > XPATH.
> > 2. Configure Lucene to add keywords to a new field.  Create the index.
> > 3. Change the Search page to allow selection by keywords.

> This only leaves me with the question how i should add the keywords.
> Right now, it is just one string with a \n seperator for the different
> keywords. I would also like to add a boost factor to the individual
> keywords.

> The alternative would be a nice extension of the Lenya GUI to edit an
> xml list of keywords and boost factor. This sounds more lenya-like to a
> lenya newbie as i am. Any suggestions?

Thanks Michi (see his post): Lucene's default is for HTML, but any
configuration requires XML, so you'll be working with XML.

You can create a new "keywords" field for use by the Search front-end.
 Lucene indexes on words, so separating with a space works well.  It
does not do well separating using tags, because they are removed
without adding a whitespace separator.  (I think that is called a
bug.)

What business purpose would "boost" help?  Lucene would probably need
to be completely rewritten to support something like it.  Can you
design an interface that adds enough value to compensate for the extra
confusion?

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to