Re: RFR: 8043592: The basic XML parser based on UKit fails to read XML files encoded in UTF-16BE or LE

huizhe wang Tue, 27 May 2014 15:37:18 -0700

Thanks Sherman!

On 5/27/2014 1:46 PM, Xueming Shen wrote:

One more nit,


ln#2876-2879

Do we really need to create a new ReaderUTF8, if the encoding isindeed is utf8?which I would assume should be true for most use scenarios. Maybe thefollowing

would be better?

                      //          Encoding is defined by the xml text decl.
                      reader = enc("UTF-8", is.getByteStream());
                      expenc = xml(reader);
                      if (!expenc.equals("UTF-8")) {
                          if (expenc.startsWith("UTF-16")) {
                              panic(FAULT);  // UTF-16 must have BOM [#4.3.3]
                          }
                          reader = enc(expenc, is.getByteStream());
                                       }

Updated to reflect the above suggestion:http://cr.openjdk.java.net/~joehw/jdk9/8043592/webrev/.

For the performance improvement, I'll create a new bug to track it.Reading is buffered in the regular jaxp parser than the UKit one. Itwould be nice if the benchmark had a separate measurement in parsingperformance. We currently have indirect measurement through validationand transform.

Buffer size will affect performance, UKit sets the default to 512 (butactually read byte by byte from the underlying stream as you noted),while jaxp parser default to 8k. For a small parser such as UKit, itmay make sense to use a smaller buffer.


-Joe


-Sherman

On 05/27/2014 12:54 PM, Xueming Shen wrote:

On 05/27/2014 10:46 AM, huizhe wang wrote:

Hi,

Are you okay with the updated patch?

Thanks,
Joe


looks fine for me.

Btw, if I took a quick look at the UTF8 reader, my observationsuggests read byte by bytefrom the underlying stream probably is the bottleneck of the overall"parsing". Attachedis a buffered the version, my simple test (just the parsing, use thedefault handler do noting)indicates it might double the parsing speed. Sure the overallperformance will depends onthe individual handler, but it might worth considering, any secondcounts :-) The code is

not fully tested though, just for your reference.

-Sherman

package jdk.internal.util.xml.impl;

import java.io.Reader;
import java.io.InputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;

/**
 * UTF-8 transformed UCS-2 character stream reader.
 *

* This reader converts UTF-8 transformed UCS-2 characters to Javacharacters.

 * The UCS-2 subset of UTF-8 transformation is described in RFC-2279 #2
 * "UTF-8 definition":
 *  0000 0000-0000 007F   0xxxxxxx
 *  0000 0080-0000 07FF   110xxxxx 10xxxxxx
 *  0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
 *

* This reader will return incorrect last character on broken UTF-8stream.

 */
public class ReaderUTF8 extends Reader {

    private InputStream is;

    private static int DEFAULT_BUFFER_SIZE = 8192;
    private byte buf[];
    private int pos, limit;

    /**
     * Constructor.
     *
     * @param is A byte input stream.
     */
    public ReaderUTF8(InputStream is) {
        this.is = is;
        this.buf = new byte[DEFAULT_BUFFER_SIZE];
        this.pos = limit = 0;
    }

    private void fill() throws IOException {
        if (pos >= buf.length) {  // no room left in buffer
            pos = limit = 0;
        }
        int n = is.read(buf, pos, buf.length - pos);
        if (n > 0) {
            limit = n + pos;
        }
    }

    /**
     * Reads characters into a portion of an array.
     *
     * @param cbuf Destination buffer.
     * @param off Offset at which to start storing characters.
     * @param len Maximum number of characters to read.
     * @exception IOException If any IO errors occur.

* @exception UnsupportedEncodingException If UCS-4 characteroccur in the stream.

     */
    public int read(char[] cbuf, int off, int len) throws IOException {
        int off0 = off;
        int end = off + len;
        while (off < len) {
            if (pos >= limit) {
                fill();
                if (pos >= limit) {
                    return (off != off0) ? off - off0 : -1;
                }
            }
            int val = buf[pos] & 0xff;
            if (val >= 0x80) {
                break;
            }
            cbuf[off++] = (char) val;
            pos++;
        }
        while (off < end) {
            if (pos >= limit) {
                fill();
                if (pos >= limit) {
                    return (off != off0) ? off - off0 : -1;
                }
            }
            int val = buf[pos++] & 0xff;
            switch (val & 0xf0) {
                case 0xc0:
                case 0xd0:
                    if (pos >= limit) {
                        fill();
                    }
                    if (pos >= limit) {

cbuf[off++] = (char) (((val & 0x1f) << 6) |(is.read() & 0x3f));

                    } else {

cbuf[off++] = (char) (((val & 0x1f) << 6) |(buf[pos++] & 0x3f));

                    }
                    break;
                case 0xe0:
                    if (pos >= limit) {
                        fill();
                    }
                    val = (val & 0x0f) << 12;
                    if (pos >= limit) {
                        val |= ((is.read() & 0x3f) << 6);
                    } else {
                        val |= ((buf[pos++] & 0x3f) << 6);
                    }
                    if (pos >= limit) {
                        val |= (buf[pos++] & 0x3f);
                    } else {
                        val |= (is.read() & 0x3f);
                    }
                    cbuf[off++] = (char) val;
                    break;
                case 0xf0:      // UCS-4 character

throw new UnsupportedEncodingException("UTF-32(or UCS-4) encoding not supported.");

                default:
                    cbuf[off++] = (char) val;
                    break;
            }
        }
        return off - off0;

    }

    /**
     * Reads a single character.
     *
     * @return The character read, as an integer in the range 0 to 65535
     *  (0x00-0xffff), or -1 if the end of the stream has been reached.
     * @exception IOException If any IO errors occur.

* @exception UnsupportedEncodingException If UCS-4 characteroccur in the stream.

     */
    public int read() throws IOException {
        int val;
        if (pos >= limit) {
            val = is.read();
        } else {
            val = buf[pos++] & 0xff;
        }
        switch (val & 0xf0) {
            case 0xc0:
            case 0xd0:
                if (pos >= limit) {
                    val = ((val & 0x1f) << 6) | (is.read() & 0x3f);
                } else {
                    val = ((val & 0x1f) << 6) | (buf[pos++] & 0x3f);
                }
                break;
            case 0xe0:
                val = (val & 0x0f) << 12;
                if (pos >= limit) {
                    val |= ((is.read() & 0x3f) << 6);
                } else {
                    val |= ((buf[pos++] & 0x3f) << 6);
                }
                if (pos >= limit) {
                    val |= (is.read() & 0x3f);
                } else {
                    val |= (buf[pos++] & 0x3f);
                }
                break;
            case 0xf0:  // UCS-4 character
                throw new UnsupportedEncodingException();
            default:
                break;
        }
        return val;
    }

    /**
     * Closes the stream.
     *
     * @exception IOException If any IO errors occur.
     */
    public void close() throws IOException {
        is.close();
    }
}

Re: RFR: 8043592: The basic XML parser based on UKit fails to read XML files encoded in UTF-16BE or LE

Reply via email to