Question of strategy with respect to multiple field types within one piece of a document

Alan Chandler Sat, 26 Nov 2005 07:10:37 -0800

I am trying to think through the approach I should take to build my blog using 
lucene (see thread above "Searching Textile Documents").


As you can see from that thread, I am thinking of making the body of my 
"Document" a coded form of text which ultimately gets translated into html.  
So I am currently thinking through what a parser might have to do to 
translate from my form of text (probably not textile as mentioned in that 
thread - but something which I can customise to my own needs).

I am planing that the lexical tokenizer will produce a series of tokens that 
contain (amongst others) tokens of the form <WORDS> or<URL>types (URLs will 
NOT be broken into WORDS).

I think I can then make the parser, depending on which non terminal element 
you call parse the article bodies in such a way that the parser will will 
either produce the html or produce a list of embedded URLs, or a list of 
embedded WORDS. 

(I am currently considering using Antlr rather than JavaCC - but that is not 
fixed in stone)

I would like to use lucene in two (actually several more, but two related to 
this question) ways.
 i      Search for words (suitably stopped and  filtered) based on the <WORDS> 
tokens passed from the lexical tokenizer
ii      List the <URL,s> refered to in a given document.

I have been trying to understand the demo which parses and loads html 
documents

So some questions

a) The HTML document demo doesn't seem to use the Analyser to determine what 
gets indexed, but to rather control the process by determining what happens 
inside the Document.add Field calls.  Is this the better way of doing things, 
rather than somehow interfacing the Analyzer into my grammer parser?  If I 
need to interface into the parser, how?

b) I assume I will want to create two separate Document Fields with the 
original content (a String named content) being the first of type  
Text("content", String content), and the second for the urls being of type 
Unstored("url", String urls), where the urls are those from the article body.  
The HTML document demo seems to do something equivalent via a complex 
mechanism of multithreading.  I think that one way I could do it would be 
simply re-parse twice - is there any reason why the demo took this complex 
route rather than the simple one of parsing twice?.

c) An alternative way of doing the whole thing - might be to do Document.add 
calls inside the lexical tokenizer.  Since I can add the field with the same 
name multiple times, the javadoc says the new content is just appended.   Are 
there any downsides to this approach?  Can I still store the whole document 
this way - by adding all lexical tokens to one of the field names (presumably 
of type keyword - field name "content") and using field name "urls" of type 
(? can't see how to set the field type to indexed, not stored, not tokenized)
  
-- 
Alan Chandler
http://www.chandlerfamily.org.uk
Open Source. It's the difference between trust and antitrust.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Question of strategy with respect to multiple field types within one piece of a document

Reply via email to