> I suspect that my issue is getting the Field constructor to use a
> different tokenizer. Can anyone help?
You need to basically come up with your own Tokenizer (You can always
write a corresponding JavaCC grammar and compiling it would give the
Tokenizer)
Then you need to extend org.apache.lu
You can use text extractors for the document formats you mentioned.
Lucene as such does not deal with this text extraction process.
Following are the extractors we generally use:
PDF -> PDFBox: Java API to read PDF documents
http://www.pdfbox.org.
WORD-> Antiword: http://www