Dear David,
NaviServer is less strict than the W3C-document, since it does not send
automatically an error back.
Such invalid characters can show up during decode operations of
ns_urldecode and ns_getform.
So, a custom application can catch exceptions and try alternative
encodings if necessary.
Since there is currently a large refactoring concerning Unicode handling
going on for
the Tcl community (with potentially different handling in Tcl 8.6, 8.7
and 9.0, ... hopefully
there will be full support for Unicode already in Tcl 8.7, the voting is
happening right now)
it is not a good idea to come up with a special handling by NaviServer.
These byte sequences
have to be processed sooner or later by Tcl in various versions...
I do not think it is a good idea to swallow incorrect incoming data by
transforming this
on the fly, this will cause sooner or later user concerns (e.g. "why is
this funny character
in the user name", ...) When the legacy application sends e.g. iso8859
encoded data, then it
should set the appropriate charset, and it will be properly converted by
NaviServer.
If for whatever reason this is not feasible to get a proper charset,
then the NaviServer
approach allows to make a second attempt of decoding the data with a
different charset.
all the best
-gn
On 12.05.22 11:05, David Osborne wrote:
Thanks again Gustaf,
I can see the W3C spec you reference seems quite unequivocal in saying
an error message should be sent back when decoding invalid UTF-8 form
data.
But I was curious why other implementations appear to use the UTF-8
replacement character (U+FFFD) instead, and found a bit of discussion
in the unicode standard itself [1] & [2].
[1] specifically refers to the WHATWG(W3C) spec for encoding/decoding
[3] which defines an "error" condition when decoding UTF-8 as being
one of two possible error modes:
Namely:
* fatal - "return the error"
* replacement - "Push U+FFFD (�) to output."
This aligns with the behaviour of, say, Python's bytes.decode() where
the default is to raise an error for encoding errors ("strict" error
handling), but optionally, you can specify "replace" error handling
which will utilise the U+FFFD character instead. I can see this
working in cases where we're told the data should be UTF-8, or where
we're assuming by default it's UTF-8.
But I'm not sure how much work this would be to implement and whether
it is seen as worthwhile to others?
As it stands, we have legacy applications which POSTs data to us which
regularly (and, by now, expectedly) sends invalid characters despite
best efforts to fix it.
I guess we would redirect the POSTs to another non-naviserver system,
sanitise the data there, then send it on to NaviServer, but it would
be nice to be able to deal with it within NaviServer itself.
[1] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Section
3.9 "U+FFFD Substitution of Maximal Subparts")
[2] https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf (Section
5.22 "U+FFFD Substitution in Conversion")
[3] https://encoding.spec.whatwg.org/#decoder
[4] https://docs.python.org/3/library/stdtypes.html#bytes.decode
On Mon, 2 May 2022 at 13:30, Gustaf Neumann <neum...@wu.ac.at> wrote:
Dear David and all,
I looked into this issue, and I do not like the current situation
either.
In the current snapshot, a GET request with invalid coded
query variables is rejected, while the POST request leads just
to the warning, and the invalid entry is omitted.
W3C [1] says in the reference for Multilingual form encoding:
> If non-UTF-8 data is received, an error message should be sent back.
This means, that the only defensible logic is to reject in both cases
the request as invalid. One can certainly send single-byte funny
character
data in URLs, which is invalid UTF8 (e.g. "%9C" or "%E6" etc.),
but for these requests, the charset has to be specified, either
via content type, or via the default URL encoding in the NaviServer
configuration... see example (2) below.
As mentioned earlier, there are increasingly many attacks with invalid
UTF-8 data (also by vulnerability scanners), so we to be strict here.
I will try to address the outstanding issues ASAP and provide then
another RC.
All the best
-gn
[1] https://www.w3.org/International/questions/qa-forms-utf-8
# POST request with already encoded form data (x-www-form-urlencoded)
$ curl -X POST -d "p1=a%C5%93Cb&p2=a%E6b" localhost:8100/upload.tcl
# POST request with already encoded form data, but proper encoding
$ curl -X POST -H "Content-Type: application/x-www-form-urlencoded;
charset=iso-8859-1" -d "p2=a%E6b" localhost:8100/upload.tcl
# POST + x-www-form-urlencoded, but let curl do the encoding
$ curl -X POST -d "p1=aœb" -d $(echo -e 'p2=a\xE6b')
localhost:8100/upload.tcl
# POST + multipart/form-data, let curl do the encoding
$ curl -X POST -F "p1=aœb" -F $(echo -e 'p2=a\xE6b')
localhost:8100/upload.tcl
POST request with already encoded form data (x-www-form-urlencoded)
$ curl -X GET "localhost:8100/upload.tcl?p1=a%C5%93Cb&p2=a%E6b"
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel
--
Univ.Prof. Dr. Gustaf Neumann
Head of the Institute of Information Systems and New Media
of Vienna University of Economics and Business
Program Director of MSc "Information Systems"
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel