PLEASE DO NOT REPLY TO THIS MESSAGE. TO FURTHER COMMENT ON THE STATUS OF THIS BUG PLEASE FOLLOW THE LINK BELOW AND USE THE ON-LINE APPLICATION. REPLYING TO THIS MESSAGE DOES NOT UPDATE THE DATABASE, AND SO YOUR COMMENT WILL BE LOST SOMEWHERE. http://nagoya.apache.org/bugzilla/show_bug.cgi?id=2793 *** shadow/2793 Wed Jul 25 12:50:09 2001 --- shadow/2793.tmp.6221 Wed Jul 25 12:50:09 2001 *************** *** 0 **** --- 1,87 ---- + +============================================================================+ + | xml:lang should support RFC3066 and ISO639-2. | + +----------------------------------------------------------------------------+ + | Bug #: 2793 Product: Xerces-J | + | Status: NEW Version: 1.4 | + | Resolution: Platform: Other | + | Severity: Normal OS/Version: Other | + | Priority: Other Component: Core | + +----------------------------------------------------------------------------+ + | Assigned To: [EMAIL PROTECTED] | + | Reported By: [EMAIL PROTECTED] | + | CC list: Cc: | + +----------------------------------------------------------------------------+ + | URL: | + +============================================================================+ + | DESCRIPTION | + The xml spec (Version 2, p 2.12) states that xml:lang should conform + to RFC1766 or its successor. RFC3066 is the successor to RFC1766. + + RFC3066 allows the new language codes of three characters not two, + defined by ISO639-2. + + RFC3066 also allows digits in second and subsequent tags. + + Hence each of the following XML elements is legal but rejected by Xerces-J. + + <x xml:lang="ale"/> + <x xml:lang="x-33"/> + <x xml:lang="en-US-f5"/> + + + Moreover Xerces-J accepts the two following syntactically + illegal languages: + <x xml:lang="en-s"/> + <x xml:lang="en-abcdefghij"/> + + Both are illegal because after an ISO-639 code, the second subtag may consist + of: + + a two letter country code from ISO3166 + or + + between 3 and 8 characters or digits. + + Of these defects the most important is the three character language codes. + + === + + + I take the defect to be in: + + org.apache.xerces.framework.XMLDocumentScanner.checkXMLLangAttributeValue(int) + + that file is unchanged since version 1.4.0 which I have used. + + === + + + The syntactic constraints are: + case-insensitive + + First tag: + either "I" or "X" or [A-Z][A-Z] or [A-Z][A-Z][A-Z] + + Second tag: + when first tag is "I" or "X" then second tag is + [0-9A-Z]{1,8} + + when first tag is [A-Z][A-Z] or [A-Z][A-Z][A-Z] then second tag is + [A-Z][A-Z] or [0-9A-Z]{3,8} + + subsequent tags + [0-9A-Z]{1,8} + + other rules depend on having lookup tables of IANA, ISO639 and ISO3166. + + + === + + I have Java code that checks values against the tables, which I can + forward if you want. + The URLs for the tables are: + + http://lcweb.loc.gov/standards/iso639-2/englangn.html + http://www.din.de/gremien/nas/nabd/iso3166ma/codlstp1/db_en.html + http://www.iana.org/assignments/language-tags + + + Thanks. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
