Hi all, Martin was right. I just adapt the HTML demo as Wallen recommended and it worked. Now I have only to deal with some crazy documents which are UTF-8 decoded mixed with entities. Does anyone know a class which can translate entities into UTF-8 or any other encoding?
Peter MH -----Ursprüngliche Nachricht----- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -----Original Message----- From: Martin Remy [mailto:[EMAIL PROTECTED] The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]