On 19 Jan 2004, at 17:26, Erik Hatcher wrote:
On Jan 19, 2004, at 10:31 AM, Stefano Mazzocchi wrote:A while ago, thinking about this, I proposed the addition of a numerical namespace to the lucene mailing list but the suggesting didn't catch up [I also have the impression they didn't get my point, but was low priority so I dropped the subject]
I don't recall that thread, but Lucene can index numerical information as long as you take care to make sure it is lexicographically ordered. For example, I often index dates as YYYYMMDD format so that they are ordered "alphabetically".
Lucene does not provide any built-in mechanisms for turning a number into such a lexicographically ordered string, but it does for timestamps. It would only be a few lines of custom code to convert back and forth, but Lucene deals with text only, not numbers directly.
My goal was to have an XML indexer that was able to index stuff like
<p xmlns="...">This an <b>important message</b></p>
and have a way to associate 'semantic relevance rating' to the various schemas and element names.
So, if you looked for "message" and message was text inside a <b> tag inside a <p> tag in a particular namespace, you would index "message" with a "relevance rate" associated with the location in the XML tree.
Note: this relevance rating is short of a "importance sheet", think of CSS for how meaningful information is in your text and this could be a per-namespace thing.
This allows an indexer to index XML more meaningfully (the semantic relevance rate could be zero... as it could be useful for some SVG text, for example) and with a complexity that depends on the numbers of schemas/namespaces used and not with the documents (just like stylesheets)
These "importance-sheets" are the first step toward semantic-sheets, but this is another story.
Anyway, lucene has to support the ability to say:
index the token "message" and give a rating of this particualr token in this particular context as "0.34"
then, at query time, it should have been able to use that rating info to aggregate the importance of that page in that query in that context.
But this didn't catch up.
-- Stefano.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
