Dear David,

NaviServer is less strict than the W3C-document, since it does not send automatically an error back. Such invalid characters can show up during decode operations of ns_urldecode and ns_getform. So, a custom application can catch exceptions and try alternative encodings if necessary.

Since there is currently a large refactoring concerning Unicode handling going on for the Tcl community (with potentially different handling in Tcl 8.6, 8.7 and 9.0, ... hopefully there will be full support for Unicode already in Tcl 8.7, the voting is happening right now) it is not a good idea to come up with a special handling by NaviServer. These byte sequences
have to be processed sooner or later by Tcl in various versions...

I do not think it is a good idea to swallow incorrect incoming data by transforming this on the fly, this will cause sooner or later user concerns (e.g. "why is this funny character in the user name", ...) When the legacy application sends e.g. iso8859 encoded data, then it should set the appropriate charset, and it will be properly converted by NaviServer.

If for whatever reason this is not feasible to get a proper charset, then the NaviServer approach allows to make a second attempt of decoding  the data with a different charset.

all the best

-gn

On 12.05.22 11:05, David Osborne wrote:

Thanks again Gustaf,

I can see the W3C spec you reference seems quite unequivocal in saying an error message should be sent back when decoding invalid UTF-8 form data.

But I was curious why other implementations appear to use the UTF-8 replacement character (U+FFFD) instead, and found a bit of discussion in the unicode standard itself [1] & [2].

[1] specifically refers to the WHATWG(W3C) spec for encoding/decoding [3] which defines an "error" condition when decoding UTF-8 as being one of two possible error modes:
Namely:

  * fatal - "return the error"
  * replacement - "Push U+FFFD (�) to output."

This aligns with the behaviour of, say, Python's bytes.decode() where the default is to raise an error for encoding errors ("strict" error handling), but optionally, you can specify "replace" error handling which will utilise the U+FFFD character instead. I can see this working in cases where we're told the data should be UTF-8, or where we're assuming by default it's UTF-8.

But I'm not sure how much work this would be to implement and whether it is seen as worthwhile to others?

As it stands, we have legacy applications which POSTs data to us which regularly (and, by now, expectedly) sends invalid characters despite best efforts to fix it. I guess we would redirect the POSTs to another non-naviserver system, sanitise the data there, then send it on to NaviServer, but it would be nice to be able to deal with it within NaviServer itself.

[1] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Section 3.9 "U+FFFD Substitution of Maximal Subparts") [2] https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf (Section 5.22 "U+FFFD Substitution in Conversion")
[3] https://encoding.spec.whatwg.org/#decoder
[4] https://docs.python.org/3/library/stdtypes.html#bytes.decode


On Mon, 2 May 2022 at 13:30, Gustaf Neumann <neum...@wu.ac.at> wrote:

    Dear David and all,

    I looked into this issue, and I do not like the current situation
    either.
    In the current snapshot, a GET request with invalid coded
    query variables is rejected, while the POST request leads just
    to the warning, and the invalid entry is omitted.

    W3C [1] says in the reference for Multilingual form encoding:
    > If non-UTF-8 data is received, an error message should be sent back.

    This means, that the only defensible logic is to reject in both cases
    the request as invalid. One can certainly send single-byte funny
    character
    data in URLs, which is invalid UTF8 (e.g. "%9C" or "%E6" etc.),
    but for these requests, the charset has to be specified, either
    via content type, or via the default URL encoding in the NaviServer
    configuration... see example (2)  below.

    As mentioned earlier, there are increasingly many attacks with invalid
    UTF-8 data (also by vulnerability scanners), so we to be strict here.

    I will try to address the outstanding issues ASAP and provide then
    another RC.

    All the best

    -gn

    [1] https://www.w3.org/International/questions/qa-forms-utf-8


      # POST request with already encoded form data (x-www-form-urlencoded)
      $ curl -X POST -d "p1=a%C5%93Cb&p2=a%E6b" localhost:8100/upload.tcl

      # POST request with already encoded form data, but proper encoding
      $ curl -X POST -H "Content-Type: application/x-www-form-urlencoded; 
charset=iso-8859-1" -d "p2=a%E6b" localhost:8100/upload.tcl

      # POST + x-www-form-urlencoded, but let curl do the encoding
      $ curl -X POST -d "p1=aœb" -d $(echo -e 'p2=a\xE6b') 
localhost:8100/upload.tcl

      # POST + multipart/form-data, let curl do the encoding
      $ curl -X POST -F "p1=aœb" -F $(echo -e 'p2=a\xE6b') 
localhost:8100/upload.tcl

      POST request with already encoded form data (x-www-form-urlencoded)
      $ curl -X GET  "localhost:8100/upload.tcl?p1=a%C5%93Cb&p2=a%E6b"




_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

--
Univ.Prof. Dr. Gustaf Neumann
Head of the Institute of Information Systems and New Media
of Vienna University of Economics and Business
Program Director of MSc "Information Systems"
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to