Not sure about WordExtractor, does it also take a Reader? I would try:
Reader input = new InputStreamReader(new FileInputStream(file),
"ENCODING");
WordExtractor extractor = new WordExtractor(input);
content = extractor.getText();
Note: ENCODING is whatever encoding the file is in, as in "UTF-8", if
that is what your files are in. If you don't know the encoding, you
will need to add in some type of character encoding detection tool.
Search the web for that, as I know there are some out there (I don't
know any names off hand).
Bottom line, it sounds like you need to figure out how to load your
files based on their encoding. That problem is not really core to
Lucene, but you should be able to search the archives here to find
others with similar questions.
-Grant
On Feb 18, 2008, at 8:13 AM, kratoras wrote:
No problem about the misunderstanding.
I am using
InputStream input =new URL ( "file:///"+file.getAbsolutePath()
).openStream ();
WordExtractor extractor = new WordExtractor(input);
content=extractor.getText();
where the wordextractor is
org.apache.poi.hwpf.extractor.WordExtractor;
The wordextractor takes an inputstream as an arguement. Should i
determine
the encoding of the inputstream and how?
--
View this message in context:
http://www.nabble.com/Problem-using-Lucene-on-Ubuntu-tp15543843p15545082.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]