Hello,

We've been investigating validating UTF-8 to prevent a few issues we've
been encountering lately. I noted that validation of UTF-8 has been
provided in 5.0.0 and have been investigating ns_valid_utf8 as well as the
updated ns_getform as they would appear to be very helpful to us!

I put together a number of test cases of invalid UTF-8 and verified that
these were invalid sequences using iconv. I repeated these tests with
ns_valid_utf8 and it disagreed with iconv in 6 cases.

To demonstrate these cases, I have added comprehensive unit tests and noted
my findings in a pull request:

https://github.com/naviserver-project/naviserver/pull/14/files

Furthermore, I cannot get ns_getform to throw the NS_INVALID_UTF8 error
when the form data contains invalid UTF-8.

I have tried many different combinations of configurations and arguments.
It is my understanding that if there is no fallbackcharset specified, and
the config param formfallbackcharset has not been set, then an error should
be thrown with invalid UTF-8 being present in the form.

I have used the nsd-config.tcl provided in the GitHub repo on a fresh
install. The encoding params under ns/parameters have been set up so that
only URLCharset is specified. OutputCharset and formfallbackcharset are
both commented out.

# Encoding settings
#
# ns_param  OutputCharset   utf-8
ns_param    URLCharset      utf-8
# ns_param formfallbackcharset iso8859-1


A simple proc will get the form data and return it to the client after
logging the result of ns_valid_utf8:

proc /test {} {
    set set_id [ns_getform]
    set test [ns_set get $set_id test]
    ns_log Notice "Valid UTF-8: [ns_valid_utf8 $test]"
    ns_return 200 text/html "<p>Received: $test</p>"
}

ns_register_proc POST /test /test
ns_register_proc GET /test /test


POST and GET requests with invalid UTF8 sequences in the form data are made
using curl:

curl -X POST "http://127.0.0.1:8080/test"; -H "Content-Type:
application/x-www-form-urlencoded" --data-binary "$(printf
'test=test\x80test')" --output -
<p>Received: testtest</p>

curl "http://127.0.0.1:8080/test?test=$(printf 'test\x80test')" --output -
<p>Received: testtest</p>


No errors have been thrown. In both cases, ns_valid_utf8 has determined
that the value for the form variable "test" is invalid. The log statement
from NS_ParseRequest appears to acknowledge that there is an invalid
sequence in the query string on the GET request too:

POST request:
[14/Aug/2025:11:56:33][2408.7f1b52ffd6c0][-conn:default:default:4:4-]
Notice: Valid UTF-8: 0

GET request:
[14/Aug/2025:11:58:03][2408.7f1b52ffd6c0][-driver:http:0-] Warning:
Ns_ParseRequest: line <GET /test?test=test\x80test HTTP/1.1> contains 8-bit
character data. Future versions might reject it.
[14/Aug/2025:11:58:03][2408.7f1b52ffd6c0][-conn:default:default:4:5-]
Notice: Valid UTF-8: 0


These tests have repeated with different configurations and arguments. For
example, passing the encoding and/or the fallbackcharset to ns_getform:

ns_getform utf-8
ns_getform -fallbackcharset ""
ns_getform -fallbackcharset "" utf-8
ns_getform -fallbackcharset "utf-8" utf-8

Configuration changes included uncommenting the formfallbackcharset param
and setting it to the empty string or to "utf-8".

Is this a misunderstanding on my part of how ns_getform should work? If
there is any more information that I can provide then please let me know.

Kind regards,

Nicky

--
Qcode Software
*Nicky Johnstone | Software Engineer*
*Email:* [email protected] | *Phone:* 01463 896 487
www.qcode.co.uk
_______________________________________________
naviserver-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to