I am trying to think through the approach I should take to build my blog using
lucene (see thread above "Searching Textile Documents").
As you can see from that thread, I am thinking of making the body of my
"Document" a coded form of text which ultimately gets translated into html.
So I am currently thinking through what a parser might have to do to
translate from my form of text (probably not textile as mentioned in that
thread - but something which I can customise to my own needs).
I am planing that the lexical tokenizer will produce a series of tokens that
contain (amongst others) tokens of the form <WORDS> or<URL>types (URLs will
NOT be broken into WORDS).
I think I can then make the parser, depending on which non terminal element
you call parse the article bodies in such a way that the parser will will
either produce the html or produce a list of embedded URLs, or a list of
embedded WORDS.
(I am currently considering using Antlr rather than JavaCC - but that is not
fixed in stone)
I would like to use lucene in two (actually several more, but two related to
this question) ways.
i Search for words (suitably stopped and filtered) based on the <WORDS>
tokens passed from the lexical tokenizer
ii List the <URL,s> refered to in a given document.
I have been trying to understand the demo which parses and loads html
documents
So some questions
a) The HTML document demo doesn't seem to use the Analyser to determine what
gets indexed, but to rather control the process by determining what happens
inside the Document.add Field calls. Is this the better way of doing things,
rather than somehow interfacing the Analyzer into my grammer parser? If I
need to interface into the parser, how?
b) I assume I will want to create two separate Document Fields with the
original content (a String named content) being the first of type
Text("content", String content), and the second for the urls being of type
Unstored("url", String urls), where the urls are those from the article body.
The HTML document demo seems to do something equivalent via a complex
mechanism of multithreading. I think that one way I could do it would be
simply re-parse twice - is there any reason why the demo took this complex
route rather than the simple one of parsing twice?.
c) An alternative way of doing the whole thing - might be to do Document.add
calls inside the lexical tokenizer. Since I can add the field with the same
name multiple times, the javadoc says the new content is just appended. Are
there any downsides to this approach? Can I still store the whole document
this way - by adding all lexical tokens to one of the field names (presumably
of type keyword - field name "content") and using field name "urls" of type
(? can't see how to set the field type to indexed, not stored, not tokenized)
--
Alan Chandler
http://www.chandlerfamily.org.uk
Open Source. It's the difference between trust and antitrust.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]