Good point Jan!
On Feb 18, 2008, at 9:00 AM, Jan Peter Stotz wrote:
Grant Ingersoll wrote:
Note: ENCODING is whatever encoding the file is in, as in "UTF-8",
if that is what your files are in.
I think there is a misunderstanding, the WordExtractor extracts text
from MS Word (.doc) files. Those files are binary and therefore does
not have an encoding.
I would print out the extracted text into a plain text files and
compare if there are differences between the file generated on
Windows and Linux/Ubuntu. This allows to determine if this is a
WordExtractor or a Lucene problem.
Jan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]