AW: Problem indexing Spanish Characters

PEP AD Server Administrator Fri, 21 May 2004 04:02:24 -0700

Hi all,
Martin was right. I just adapt the HTML demo as Wallen recommended and it
worked. Now I have only to deal with some crazy documents which are UTF-8
decoded mixed with entities.
Does anyone know a class which can translate entities into UTF-8 or any
other encoding?


Peter MH

-----Ursprüngliche Nachricht-----
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

        public Reader getReader() throws IOException
        {
                if (pipeIn == null)
                {
                        pipeInStream = new MyPipedInputStream();
                        pipeOutStream = new PipedOutputStream(pipeInStream);
                        pipeIn = new InputStreamReader(pipeInStream);
                        pipeOut = new OutputStreamWriter(pipeOutStream);
                        //check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
                        if (useUTF16)
                        {
                                try
                                {
                                        pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, "UTF-16"));
                                }
                                catch (Exception e)
                                {
                                }
                        }
                        Thread thread = new ParserThread(this);
                        thread.start(); // start parsing
                }
                return pipeIn;
        }

-----Original Message-----
From: Martin Remy [mailto:[EMAIL PROTECTED]

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: Problem indexing Spanish Characters

Reply via email to