Not sure about WordExtractor, does it also take a Reader?  I would try:

Reader input = new InputStreamReader(new FileInputStream(file), "ENCODING");
WordExtractor extractor = new WordExtractor(input);
content = extractor.getText();

Note: ENCODING is whatever encoding the file is in, as in "UTF-8", if that is what your files are in. If you don't know the encoding, you will need to add in some type of character encoding detection tool. Search the web for that, as I know there are some out there (I don't know any names off hand).

Bottom line, it sounds like you need to figure out how to load your files based on their encoding. That problem is not really core to Lucene, but you should be able to search the archives here to find others with similar questions.

-Grant

On Feb 18, 2008, at 8:13 AM, kratoras wrote:


No problem about the misunderstanding.
I am using

InputStream input =new URL (  "file:///"+file.getAbsolutePath()
).openStream ();
WordExtractor  extractor = new WordExtractor(input);
content=extractor.getText();

where the wordextractor is org.apache.poi.hwpf.extractor.WordExtractor;

The wordextractor takes an inputstream as an arguement. Should i determine
the encoding of the inputstream and how?
--
View this message in context: 
http://www.nabble.com/Problem-using-Lucene-on-Ubuntu-tp15543843p15545082.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to