Hi David,

First, Xerces 1.4.1 does support UTF-16.  If you have a UTF-16 document
that it isn't recognizing, I'd love to see it.

Second, the problem you're having with characters is probably caused by the
fact that Xerces may return multiple character callbacks for a single piece
of character content.  The SAX spec explicitly allows implementations to do
this--to avoid buffering problems--and Xerces takes advantage of this
flexibility.

I have no idea about the locator problem.  I think Xerces generally does a
fair job of reporting the locations of problems in instance documents, but
it's certainly true that it doesn't always for grammars.  This is only my
perception however and I'd be interested to hear your experiences.

Cheers,
Neil

Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  416-448-3519, T/L 778-3519
E-mail:  [EMAIL PROTECTED]



David Sedlock <[EMAIL PROTECTED]> on 07/13/2001 03:04:17 AM

Please respond to [EMAIL PROTECTED]

To:   [EMAIL PROTECTED]
cc:
Subject:  xml4j_1_1_14 -> xerces-1_4_1


I'm trying to upgrade from the old IBM SAX parser xml4j_1_1_14 to
xerces-1_4_1 and having a little trouble. Maybe someone can help.

-xerces-1_4_1 doesn't support UTF-16. Why not? And when might Xerces
support it?

-I used to get line numbers with Locator. Now I get nothing.

-xerces-1_4_1 seems to be dropping characters. For example, here is a
diff between my test output for xml4j_1_1_14 and xerces-1_4_1:

***************
*** 632,638 ****
                SLMA_OBJ_DTD_ID: 17
                SLMA_DTD_ID: 2
                SLMA_OBJTYP_ID: 10
!               MA_DTD_TOPLEVEL_ELM: DATE
        Tuple: SLMA_OBJ_DTD
                SLMA_OBJ_DTD_ID: 18
                SLMA_DTD_ID: 2
--- 633,639 ----
                SLMA_OBJ_DTD_ID: 17
                SLMA_DTD_ID: 2
                SLMA_OBJTYP_ID: 10
!               SLMA_DTD_TOPLEVEL_ELM: DATE
        Tuple: SLMA_OBJ_DTD
                SLMA_OBJ_DTD_ID: 18
                SLMA_DTD_ID: 2

"SLMA_DTD_TOPLEVEL_ELM" has become "MA_DTD_TOPLEVEL_ELM". There are
scores of such diffs in my test data. I did a little debugging and
came across something interesting:

SLMA_OBJTYP_ID 16295 14
10 16328 2
SL 16382 2
MA_DTD_TOPLEVEL_ELM 0 19
DATE 38 4

This is the result of this line in my HandlerBase.characters method
impelementation:

    System.err.println(new String(ch, start, length) + " " + start + "
" + length);

16384 happens to be 2^14. So it looks like the characters array is
getting cleaned out at this point and the string that straddles the
boundry is getting cut in two. Here is the next occurrence of the
problem:

GetVersions 16312 11
SLMA_REQ_ 16375 9
DEFAULT_CLASS 0 13

Same effect. The correct value here is "SLMA_REQ_DEFAULT_CLASS".

The same place in the test output with xml4j_1_1_14 looks like this:

SLMA_OBJTYP_ID 0 14
10 0 2
SLMA_DTD_TOPLEVEL_ELM 0 21
DATE 0 4

It looks as if the implementation was changed from using a new
characters array for each stretch of CDATA to using one array.

Thanks!
David

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to