On Monday 19 April 2004 14:01, Mario Ivankovits wrote: > Stephane James Vaucher wrote: > > Anyone try what Joerg suggested here? > > http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED] > >pache.org&msgNo=6231 > > Dont know what you would like to do, but if you simply would like to > extract text, you could simply try this sniplet:
This leads to question I was thinking; it seems that originally this thread started by someone pointing that OO can be used as converter from other formats... but how about tokenizer for native OO documents? I have written full-featured converters from OO to (simplified) DocBook and HTML, and creating one for just tokenizing to be used by Lucene would be much easier. Even if it would tokenize into separate fields (document metadata, content, maybe bibliography separately etc), it'd be easy to do. Would anyone find full-featured, customizable OpenOffice document tokenizer useful? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]