Hi all,

   I realized that I replied only to Sungjin and neglected to include the
mailing list, so let me repeat.

Section 3.1 of RFC 3987 defines a mapping between IRIs and URIs such that
UTF-8 sequences are percent-encoded.
So I implemented a procedure iri->uri, which percent-encodes a UTF-8 string
and passes it to the usual URI constructor in uri-generic.
It is intended to work as follows:

(iri->uri "http://example.com/삼계탕";) =>
#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)

However, the uri-generic constructor tries to normalize all URIs by percent
decoding them, so currently the URL above results in this:

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"�%82%BC�%B3%84�%83%95") query=#f fragment=#f)


  In other words, parts of the percent-encoded UTF-8 sequences are decoded
back to unprintable ASCII characters.
So a better solution might indeed be to change iri->uri to pass the
percent-encoded sequences directly to make-uri without attempts at
percent-decoding normalization.

  Sungjin's modification to the definition of 'unstructured' is in line
with the IRI RFC (except of course we will need to add all other character
sets besides Hangul).
However, it was already pointed out by Peter and Alex that URIs containing
native UTF-8 sequences might results in invalid URLs being sent to systems
that do not understand IRIs or UTF-8.

I will modify iri->uri to avoid normalization and see if this would produce
ok results.

  Ivan














On Tue, Jan 15, 2013 at 12:20 PM, Alex Shinn <alexsh...@gmail.com> wrote:

> =삼계탕&start=0&rows=10<http://127.0.0.1:8983/solr/select?q=%EC%82%BC%EA%B3%84%ED%83%95&start=0&rows=10>
_______________________________________________
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Reply via email to