[ http://nagoya.apache.org/jira/browse/XERCESC-770?page=history ]
Alberto Massari updated XERCESC-770:
------------------------------------
Priority: Major
> IANA charset names list inefficient; useful?
> --------------------------------------------
>
> Key: XERCESC-770
> URL: http://nagoya.apache.org/jira/browse/XERCESC-770
> Project: Xerces-C++
> Type: Bug
> Components: Utilities
> Versions: 2.1.0
> Environment: Operating System: All
> Platform: All
> Reporter: Markus Scherer
> Assignee: Xerces-C Developers Mailing List
>
> The IANA charset names list is stored inefficiently. It alone takes up 200 kB
> in the Xerces library.
> internal/IANAEncodings.hpp contains const XMLCh gEncodingArray[791][128]. This
> uses sizeof(XMLCh)*791*128 or about 200000 bytes. Most of the names are shorter
> than 15 or so characters, and only ASCII characters are ever used in IANA
> charset names. The names should therefore be stored as ASCII bytes, and only as
> many per name as necessary.
> As a simpler means of making this array smaller, the IANA charset registration
> imposes an upper limit of 40 characters for charset names. There are only two
> registered names that violate this (I think), they could be safely omitted. Add
> space for the NUL. 128 characters per name is way overkill.
> I also wonder whether this list is useful at all. Xerces only verifies that a
> name exists in the list. It does not verify that it has a converter for it
> (other than failing to open it, which does not use this list). It cannot verify
> that what the XML document claims its charset is matches the converter that
> Xerces is going to open for this name (e.g., mismatches between Shift-JIS etc.
> among Windows/Unix/mainframe, see W3C Japanese profile for XML).
> I suggest to add a compile-time option (#ifdef) to remove the IANA charset name
> list (#ifdef out the use of EncodingValidator in util/TransService.cpp).
> Note that ICU4C 2.2+ has data structures and APIs for dealing with charset
> names associated with various standards (like IANA) and platforms. ICU4C does
> not have a complete list of IANA names, but this is a matter of adding them to
> its convrtrs.txt, not a real implementation issue.
> Best regards,
> markus
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]