Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Giuseppe D'Angelo
Hi,

please, let's keep the discussion on the ML.

On 8 September 2013 23:10, Kurt Pattyn  wrote:
> Hi Giuseppe,
>
> this is not mentioned in the documentation, and, if QChar is following the 
> Unicode v6.2 standard, cannot be correct, as the method unicode() returns a 
> 16-bit value, which even in UTF-16 is too short (UTF-16 encoding can have 2 
> 16-bit values to represent a unicode character).

Yes, unicode() returns the UTF-16 code unit held inside the QChar
(which is just a 16 bit number...). If you want to UTF-16 encode code
points above 0x, you need surrogate pairs, i.e. pairs of QChars.
QChar itself offers methods such as isHighSurrogate/isLowSurrogate,
and various statics (surrogateToUcs4(QChar, QChar), isSurrogate(uint),
lowSurrogate(uint), etc.)

HTH,

-- 
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Thiago Macieira
On domingo, 8 de setembro de 2013 23:42:57, Konstantin Ritt wrote:
> No. QString operates on UCS-2, not UCS-4.

QString operates on UTF-16, not UCS-2.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Kurt Pattyn




On 08 Sep 2013, at 22:42, Konstantin Ritt  wrote:

> 2013/9/8 Kurt Pattyn 
>> Couldn't it be a solution to expand QChar to contain 32-bit code points iso 
>> 16-bit, and have the unicode() function return an UCS4 value?
>> 
>>  At least, I think it would be nice that the checks for valid XML characters 
>> would be concentrated in one place.
> 
> No. QString operates on UCS-2, not UCS-4.
> 
Shouldn't QString be adapted then as well? UCS-2 is obsolete since version 2.0 
of the unicode standard.

Regards,
Kurt___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Thiago Macieira
On domingo, 8 de setembro de 2013 23:05:49, Kurt Pattyn wrote:
> So, either the documentation of QChar should at least indicate that it is
> encoded in UTF-16, which I doubt (because then QChar should have place for
> 2 16-bit values), or QChar should be adapted to conform to the statement
> that Qt5.0 is Unicode 6.2 compliant, or it should be indicated that QChar
> conforms to an older standard.
> 
> So, I was just wondering how QChar fits within the Unicode 6.2 standard?
> Reading the docs doesn't clarify much.

It does very well, QChar contains one 16-bit unit, it does not actually 
enforce UTF-16.

QString is UTF-16. That's where it matters.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Konstantin Ritt
Neither of these relates to some specific Unicode version.
QChar is a 16-bit Unicode character part. QString is UCS-2-encoded. Read
about UTF-16 for more info.
Also !qdoc: QChar::highSurrogate(), QChar::lowSurrogate()

Regards,
Konstantin


2013/9/9 Kurt Pattyn 

> Hi Konstantin,
>
> that is exactly what I did.
> From the Qt5 documentation:
>
> The QChar class provides a 16-bit Unicode character.
>
> In Qt, Unicode characters are 16-bit entities without any markup or
> structure. This class represents such an entity. It is lightweight, so it
> can be used everywhere. Most compilers treat it like a unsigned short.
>
>
> There is no mention of UTF-16. Furthermore, I would expect to method
> unicode(), to return the UCS-4 representation:
>
> ushort
>  QChar::unicode() const
>
> Returns the numeric Unicode value of the 
> QChar
> .
>
>
> But, it doesn't return the unicode v6.2 value, as that can be 32-bit; I
> know. It gives me the impression that QChar is based on an old Unicode
> standard.
>
> Furthermore (again from the documentation):
>
> Qt 5.0 uses and fully supports version 6.2 of the Unicode standard.
>
>
> At least, I think this is confusing.
>
> So, either the documentation of QChar should at least indicate that it is
> encoded in UTF-16, which I doubt (because then QChar should have place for
> 2 16-bit values), or QChar should be adapted to conform to the statement
> that Qt5.0 is Unicode 6.2 compliant, or it should be indicated that QChar
> conforms to an older standard.
>
> So, I was just wondering how QChar fits within the Unicode 6.2 standard?
> Reading the docs doesn't clarify much.
>
> Kurt
>
> On 08 Sep 2013, at 22:43, Konstantin Ritt  wrote:
>
>
> Plz read the docs.
>
> Regards,
> Konstantin
>
>
> 2013/9/8 Kurt Pattyn 
>
>> On 08 Sep 2013, at 20:43, Thiago Macieira 
>> wrote:
>>
>> > On domingo, 8 de setembro de 2013 20:36:39, Kurt Pattyn wrote:
>> >
>> > It's limited by the size of QChar. It cannot contain 0x1.
>> >
>> Isn't it supposed that a QChar contains a Unicode character (which is
>> 32-bit in size)?
>>
>> Regards,
>> Kurt
>> ___
>> Development mailing list
>> Development@qt-project.org
>> http://lists.qt-project.org/mailman/listinfo/development
>>
>
>
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Thiago Macieira
On domingo, 8 de setembro de 2013 22:39:07, Kurt Pattyn wrote:
> Couldn't it be a solution to expand QChar to contain 32-bit code points iso
> 16-bit, and have the unicode() function return an UCS4 value?

Maybe in Qt 7.

(No, this is not a typo)
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Kurt Pattyn
Hi Konstantin,

that is exactly what I did.
From the Qt5 documentation:
> The QChar class provides a 16-bit Unicode character.
> In Qt, Unicode characters are 16-bit entities without any markup or 
> structure. This class represents such an entity. It is lightweight, so it can 
> be used everywhere. Most compilers treat it like a unsigned short.


There is no mention of UTF-16. Furthermore, I would expect to method unicode(), 
to return the UCS-4 representation:
> ushort QChar::unicode() const
> Returns the numeric Unicode value of the QChar.

But, it doesn't return the unicode v6.2 value, as that can be 32-bit; I know. 
It gives me the impression that QChar is based on an old Unicode standard.

Furthermore (again from the documentation):

> Qt 5.0 uses and fully supports version 6.2 of the Unicode standard.


At least, I think this is confusing.

So, either the documentation of QChar should at least indicate that it is 
encoded in UTF-16, which I doubt (because then QChar should have place for 2 
16-bit values), or QChar should be adapted to conform to the statement that 
Qt5.0 is Unicode 6.2 compliant, or it should be indicated that QChar conforms 
to an older standard.

So, I was just wondering how QChar fits within the Unicode 6.2 standard? 
Reading the docs doesn't clarify much.

Kurt

On 08 Sep 2013, at 22:43, Konstantin Ritt  wrote:

> 
> Plz read the docs.
> 
> Regards,
> Konstantin
> 
> 
> 2013/9/8 Kurt Pattyn 
>> On 08 Sep 2013, at 20:43, Thiago Macieira  wrote:
>> 
>> > On domingo, 8 de setembro de 2013 20:36:39, Kurt Pattyn wrote:
>> >
>> > It's limited by the size of QChar. It cannot contain 0x1.
>> >
>> Isn't it supposed that a QChar contains a Unicode character (which is 32-bit 
>> in size)?
>> 
>> Regards,
>> Kurt
>> ___
>> Development mailing list
>> Development@qt-project.org
>> http://lists.qt-project.org/mailman/listinfo/development
> 
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Giuseppe D'Angelo
On 8 September 2013 22:42, Kurt Pattyn  wrote:
> Isn't it supposed that a QChar contains a Unicode character (which is 32-bit 
> in size)?

No, a QChar is exactly an UTF-16 code unit.

-- 
Giuseppe D'Angelo
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Konstantin Ritt
Plz read the docs.

Regards,
Konstantin


2013/9/8 Kurt Pattyn 

> On 08 Sep 2013, at 20:43, Thiago Macieira 
> wrote:
>
> > On domingo, 8 de setembro de 2013 20:36:39, Kurt Pattyn wrote:
> >
> > It's limited by the size of QChar. It cannot contain 0x1.
> >
> Isn't it supposed that a QChar contains a Unicode character (which is
> 32-bit in size)?
>
> Regards,
> Kurt
> ___
> Development mailing list
> Development@qt-project.org
> http://lists.qt-project.org/mailman/listinfo/development
>
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Konstantin Ritt
2013/9/8 Kurt Pattyn 

> Couldn't it be a solution to expand QChar to contain 32-bit code points
> iso 16-bit, and have the unicode() function return an UCS4 value?
>
>  At least, I think it would be nice that the checks for valid XML
> characters would be concentrated in one place.
>

No. QString operates on UCS-2, not UCS-4.



> > The code calling this API needs to do the surrogate decoding. This class
> may
> > be interesting for them:
> >
> > https://codereview.qt-project.org/669
> >
> > --
> > Thiago Macieira - thiago.macieira (AT) intel.com
> >  Software Architect - Intel Open Source Technology Center
>
> Kurt
> ___
> Development mailing list
> Development@qt-project.org
> http://lists.qt-project.org/mailman/listinfo/development
>
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Kurt Pattyn
On 08 Sep 2013, at 20:43, Thiago Macieira  wrote:

> On domingo, 8 de setembro de 2013 20:36:39, Kurt Pattyn wrote:
> 
> It's limited by the size of QChar. It cannot contain 0x1.
> 
Isn't it supposed that a QChar contains a Unicode character (which is 32-bit in 
size)?

Regards,
Kurt
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Kurt Pattyn
On 08 Sep 2013, at 20:43, Thiago Macieira  wrote:

> On domingo, 8 de setembro de 2013 20:36:39, Kurt Pattyn wrote:
>> bool QXmlUtils::isChar(const QChar c)
>> {
>>return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
>>   || c.unicode() == 0x0009
>>   || c.unicode() == 0x000A
>>   || c.unicode() == 0x000D
>>   || (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
>> }
>> Isn't this code missing the check 
>> c >= 0x1 && c <= QChar::LastValidCodePoint ?
> 
> No.
> 
> It's limited by the size of QChar. It cannot contain 0x1.
> 
> No, the entire API is flawed. It should work on terms of UCS-4, not of QChar. 

Couldn't it be a solution to expand QChar to contain 32-bit code points iso 
16-bit, and have the unicode() function return an UCS4 value?

At least, I think it would be nice that the checks for valid XML characters 
would be concentrated in one place.

> The code calling this API needs to do the surrogate decoding. This class may 
> be interesting for them:
> 
> https://codereview.qt-project.org/669
> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel Open Source Technology Center

Kurt
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Thiago Macieira
On domingo, 8 de setembro de 2013 20:36:39, Kurt Pattyn wrote:
> bool QXmlUtils::isChar(const QChar c)
> {
> return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
>|| c.unicode() == 0x0009
>|| c.unicode() == 0x000A
>|| c.unicode() == 0x000D
>|| (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
> }
> Isn't this code missing the check 
> c >= 0x1 && c <= QChar::LastValidCodePoint ?

No.

It's limited by the size of QChar. It cannot contain 0x1.

No, the entire API is flawed. It should work on terms of UCS-4, not of QChar. 
The code calling this API needs to do the surrogate decoding. This class may 
be interesting for them:

https://codereview.qt-project.org/669

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center


signature.asc
Description: This is a digitally signed message part.
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Kurt Pattyn

On 08 Sep 2013, at 20:01, development-requ...@qt-project.org wrote:

> From: Konstantin Ritt 
> Subject: Re: [Development] [Question] Implementation of XML character 
> validation
> Date: 8 Sep 2013 20:00:45 GMT+02:00
> To: "development@qt-project.org" 
> 
> 
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x1-#x10]/* any Unicode character, excluding the surrogate 
> blocks, FFFE, and . */
> in XML 1.0 is quite the same as
> [2] Char   ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]
> /* any Unicode character, excluding the surrogate blocks, FFFE, and . */
> [2a] RestrictedChar::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | 
> [#x86-#x9F]
> in XML 1.1, except of [U+007F..U+0084] and [U+0086..U+009F], which are 
> prohibited now.

Is it prohibited or just "highly discouraged"? Where in XML1.0, characters 
below 0x0020 were simply not allowed (except for 0x0009, 0x000A and 0x000D), in 
XML1.1 they are called "highly discouraged" (it seems the constraints have been 
loosened a bit). How should "highly discouraged" be interpreted: should Qt 
allow them, or mark them as prohibited?

> 
> The code looks correct for XML 1.0, however I didn't find the surrogates 
> validation code neither in qxml*, neither in QUtfCodec-s. I'll probably write 
> some additional tests once have a time for that.

Correct, I didn't even notice this.

And what about?

bool QXmlUtils::isChar(const QChar c)
{
return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
   || c.unicode() == 0x0009
   || c.unicode() == 0x000A
   || c.unicode() == 0x000D
   || (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
}
Isn't this code missing the check 
c >= 0x1 && c <= QChar::LastValidCodePoint ?


> You may want to raise a suggestion/feature request via 
> http://bugreports.qt-project.org/ about upgrading XML support in Qt up to 1.1
> 
> 
> Regards,
> Konstantin
> 
Regards,

Kurt___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


Re: [Development] [Question] Implementation of XML character validation

2013-09-08 Thread Konstantin Ritt
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
| [#x1-#x10]/* any Unicode character, excluding the surrogate
blocks, FFFE, and . */
in XML 1.0 is quite the same as
[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] /* any
Unicode character, excluding the surrogate blocks, FFFE, and . */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] |
[#x86-#x9F]
in XML 1.1, except of [U+007F..U+0084] and [U+0086..U+009F], which are
prohibited now.

The code looks correct for XML 1.0, however I didn't find the surrogates
validation code neither in qxml*, neither in QUtfCodec-s. I'll probably
write some additional tests once have a time for that.

You may want to raise a suggestion/feature request via
http://bugreports.qt-project.org/ about upgrading XML support in Qt up to
1.1


Regards,
Konstantin


2013/9/8 Kurt Pattyn 

> All XML validation in Qt is based on XML 1.0 (and not the newer 1.1
> standard).
> I found at least 3 places where validity is checked:
>
> *1. in qxmlstream.cpp:*
>
> Method resolveCharRef:
>
>
>   //checks for validity
>
>   ok &= (s == 0x9 || s == 0xa || s == 0xd || (s >= 0x20 && s <= 0xd7ff)
>
>   || (s >= 0xe000 && s <= 0xfffd) || (s >= 0x1 && s <= 
> QChar::LastValidCodePoint));
>
>
>
> Method scanUntil:
>
>   //checks for invalidity
>
>   if (c < 0x20 || (c > 0xFFFD && c < 0x1) || c > 
> QChar::LastValidCodePoint )
>
>
>
> 2. *In qxmlutils.cpp*:
>
> bool QXmlUtils::isChar(const QChar c)
>
> {
>
> return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
>
>|| c.unicode() == 0x0009
>
>|| c.unicode() == 0x000A
>
>|| c.unicode() == 0x000D
>
>|| (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
>
> }
>
>
> It is pretty much the same as the above checks, except that it doesn't
> check for characters in the range 0x1 - 0x10.
> It think this is a bug, especially because the source is referring to the
> standard at http://www.w3.org/TR/REC-xml/#NT-Char, which says:
>
>
> [2]   Char   ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | 
> [#xE000-#xFFFD] | *[#x1-#x10]*/* any Unicode character, 
> excluding the surrogate blocks, FFFE, and . */
>
>
>
> 
> Now, I have three questions:
>
> 1. Can someone confirm if the check in QXmlUtils is actually a bug?
> 2. Wouldn't it be better to move these checks to QChar, so that at least
> there is only one implementation?
> 3. Is there a reason to stick to XML1.0, or should Qt also implement the
> XML1.1 standard?
> According to the XML 1.1 standard (http://www.w3.org/TR/xml11/#charsets),
> allowed characters are:
>
>
> [2]   Char ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | 
> [#x1-#x10] /* any Unicode character, excluding the surrogate 
> blocks, FFFE, and . */
>
> [2a]  RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] 
> | [#x86-#x9F]
>
>
> So the allowed character range is a little bit extended (now includes all
> characters between 0x0001 and 0x0020). In addition, XML1.1 has defined some
> characters to be highly discouraged, but still valid.
>
>
>
> ___
> Development mailing list
> Development@qt-project.org
> http://lists.qt-project.org/mailman/listinfo/development
>
>
___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development


[Development] [Question] Implementation of XML character validation

2013-09-08 Thread Kurt Pattyn
All XML validation in Qt is based on XML 1.0 (and not the newer 1.1 standard).
I found at least 3 places where validity is checked:

1. in qxmlstream.cpp:

Method resolveCharRef:

//checks for validity
ok &= (s == 0x9 || s == 0xa || s == 0xd || (s >= 0x20 && s <= 0xd7ff)
|| (s >= 0xe000 && s <= 0xfffd) || (s >= 0x1 && s <= 
QChar::LastValidCodePoint));


Method scanUntil:

//checks for invalidity
if (c < 0x20 || (c > 0xFFFD && c < 0x1) || c > 
QChar::LastValidCodePoint )


2. In qxmlutils.cpp:

bool QXmlUtils::isChar(const QChar c)
{
return (c.unicode() >= 0x0020 && c.unicode() <= 0xD7FF)
   || c.unicode() == 0x0009
   || c.unicode() == 0x000A
   || c.unicode() == 0x000D
   || (c.unicode() >= 0xE000 && c.unicode() <= 0xFFFD);
}

It is pretty much the same as the above checks, except that it doesn't check 
for characters in the range 0x1 - 0x10.
It think this is a bug, especially because the source is referring to the 
standard at http://www.w3.org/TR/REC-xml/#NT-Char, which says:

[2]   Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | 
[#xE000-#xFFFD] | [#x1-#x10]  /* any Unicode character, excluding the 
surrogate blocks, FFFE, and . */



Now, I have three questions:

1. Can someone confirm if the check in QXmlUtils is actually a bug?
2. Wouldn't it be better to move these checks to QChar, so that at least there 
is only one implementation?
3. Is there a reason to stick to XML1.0, or should Qt also implement the XML1.1 
standard?
According to the XML 1.1 standard (http://www.w3.org/TR/xml11/#charsets), 
allowed characters are:

[2]   Char ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | 
[#x1-#x10] /* any Unicode character, excluding the surrogate 
blocks, FFFE, and . */
[2a]  RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | 
[#x86-#x9F]

So the allowed character range is a little bit extended (now includes all 
characters between 0x0001 and 0x0020). In addition, XML1.1 has defined some 
characters to be highly discouraged, but still valid.


___
Development mailing list
Development@qt-project.org
http://lists.qt-project.org/mailman/listinfo/development