Re: [naviserver-devel] Towards NaviServer 4.99.24

Gustaf Neumann Thu, 12 May 2022 04:27:15 -0700

Dear David,

NaviServer is less strict than the W3C-document, since it does not sendautomatically an error back.Such invalid characters can show up during decode operations ofns_urldecode and ns_getform.So, a custom application can catch exceptions and try alternativeencodings if necessary.

Since there is currently a large refactoring concerning Unicode handlinggoing on forthe Tcl community (with potentially different handling in Tcl 8.6, 8.7and 9.0, ... hopefullythere will be full support for Unicode already in Tcl 8.7, the voting ishappening right now)it is not a good idea to come up with a special handling by NaviServer.These byte sequences

have to be processed sooner or later by Tcl in various versions...

I do not think it is a good idea to swallow incorrect incoming data bytransforming thison the fly, this will cause sooner or later user concerns (e.g. "why isthis funny characterin the user name", ...) When the legacy application sends e.g. iso8859encoded data, then itshould set the appropriate charset, and it will be properly converted byNaviServer.

If for whatever reason this is not feasible to get a proper charset,then the NaviServerapproach allows to make a second attempt of decoding the data with adifferent charset.


all the best

-gn

On 12.05.22 11:05, David Osborne wrote:

Thanks again Gustaf,
I can see the W3C spec you reference seems quite unequivocal in sayingan error message should be sent back when decoding invalid UTF-8 formdata.
But I was curious why other implementations appear to use the UTF-8replacement character (U+FFFD) instead, and found a bit of discussionin the unicode standard itself [1] & [2].
[1] specifically refers to the WHATWG(W3C) spec for encoding/decoding[3] which defines an "error" condition when decoding UTF-8 as beingone of two possible error modes:
Namely:

  * fatal - "return the error"
  * replacement - "Push U+FFFD (�) to output."
This aligns with the behaviour of, say, Python's bytes.decode() wherethe default is to raise an error for encoding errors ("strict" errorhandling), but optionally, you can specify "replace" error handlingwhich will utilise the U+FFFD character instead. I can see thisworking in cases where we're told the data should be UTF-8, or wherewe're assuming by default it's UTF-8.
But I'm not sure how much work this would be to implement and whetherit is seen as worthwhile to others?
As it stands, we have legacy applications which POSTs data to us whichregularly (and, by now, expectedly) sends invalid characters despitebest efforts to fix it.I guess we would redirect the POSTs to another non-naviserver system,sanitise the data there, then send it on to NaviServer, but it wouldbe nice to be able to deal with it within NaviServer itself.
[1] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Section3.9 "U+FFFD Substitution of Maximal Subparts")[2] https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf (Section5.22 "U+FFFD Substitution in Conversion")
[3] https://encoding.spec.whatwg.org/#decoder
[4] https://docs.python.org/3/library/stdtypes.html#bytes.decode


On Mon, 2 May 2022 at 13:30, Gustaf Neumann <neum...@wu.ac.at> wrote:

    Dear David and all,

    I looked into this issue, and I do not like the current situation
    either.
    In the current snapshot, a GET request with invalid coded
    query variables is rejected, while the POST request leads just
    to the warning, and the invalid entry is omitted.

    W3C [1] says in the reference for Multilingual form encoding:
    > If non-UTF-8 data is received, an error message should be sent back.

    This means, that the only defensible logic is to reject in both cases
    the request as invalid. One can certainly send single-byte funny
    character
    data in URLs, which is invalid UTF8 (e.g. "%9C" or "%E6" etc.),
    but for these requests, the charset has to be specified, either
    via content type, or via the default URL encoding in the NaviServer
    configuration... see example (2)  below.

    As mentioned earlier, there are increasingly many attacks with invalid
    UTF-8 data (also by vulnerability scanners), so we to be strict here.

    I will try to address the outstanding issues ASAP and provide then
    another RC.

    All the best

    -gn

    [1] https://www.w3.org/International/questions/qa-forms-utf-8


      # POST request with already encoded form data (x-www-form-urlencoded)
      $ curl -X POST -d "p1=a%C5%93Cb&p2=a%E6b" localhost:8100/upload.tcl

      # POST request with already encoded form data, but proper encoding
      $ curl -X POST -H "Content-Type: application/x-www-form-urlencoded; 
charset=iso-8859-1" -d "p2=a%E6b" localhost:8100/upload.tcl

      # POST + x-www-form-urlencoded, but let curl do the encoding
      $ curl -X POST -d "p1=aœb" -d $(echo -e 'p2=a\xE6b') 
localhost:8100/upload.tcl

      # POST + multipart/form-data, let curl do the encoding
      $ curl -X POST -F "p1=aœb" -F $(echo -e 'p2=a\xE6b') 
localhost:8100/upload.tcl

      POST request with already encoded form data (x-www-form-urlencoded)
      $ curl -X GET  "localhost:8100/upload.tcl?p1=a%C5%93Cb&p2=a%E6b"




_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel


--
Univ.Prof. Dr. Gustaf Neumann
Head of the Institute of Information Systems and New Media
of Vienna University of Economics and Business
Program Director of MSc "Information Systems"

_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Re: [naviserver-devel] Towards NaviServer 4.99.24

Reply via email to