[Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Sungjin Chun
For testing solr, lucene based client, I have to create url which contains 
utf-8 encoding(for Korean). But having this encoding uri-common cannot create 
uri. 

Can any one help me on this? Thanks.

Sent from my iPhone
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Peter Bex
On Mon, Jan 14, 2013 at 07:04:05AM +0900, Sungjin Chun wrote:
> For testing solr, lucene based client, I have to create url which contains 
> utf-8 encoding(for Korean). But having this encoding uri-common cannot create 
> uri. 
> 
> Can any one help me on this? Thanks.

Hello Sungjin,

As far as I recall, there's no special facility for IRIs
(internationalized URIs, a separate RFC from 3986) in uri-generic.
uri-generic is the underlying egg which actually handles all the
parsing, uri-common just adds some convenience procedures for
HTTP and URI-encoded forms.  Maybe you can take a look at the
uri-geneirc library, and verify it really is going wrong there already?

If it doesn't work, some test cases would be appreciated.  We can have
a look at them and see where it's failing.  I'm unsure whether this
really should be supported by the uri-generic egg or whether it would
be better to create an "iri-generic" egg, or some such.

Perhaps Ivan can chime in?  He is the one who originally ported the
code from a Haskell library, and might know whether that library had
any known problems with IRIs (if it's even IRIs we're talking about!)

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Sungjin Chun
Though I'm not that fluent in scheme, I'll try to make test case for
uri-generic with UTF-8 string.

Thanks.


On Mon, Jan 14, 2013 at 7:15 AM, Peter Bex  wrote:

> On Mon, Jan 14, 2013 at 07:04:05AM +0900, Sungjin Chun wrote:
> > For testing solr, lucene based client, I have to create url which
> contains utf-8 encoding(for Korean). But having this encoding uri-common
> cannot create uri.
> >
> > Can any one help me on this? Thanks.
>
> Hello Sungjin,
>
> As far as I recall, there's no special facility for IRIs
> (internationalized URIs, a separate RFC from 3986) in uri-generic.
> uri-generic is the underlying egg which actually handles all the
> parsing, uri-common just adds some convenience procedures for
> HTTP and URI-encoded forms.  Maybe you can take a look at the
> uri-geneirc library, and verify it really is going wrong there already?
>
> If it doesn't work, some test cases would be appreciated.  We can have
> a look at them and see where it's failing.  I'm unsure whether this
> really should be supported by the uri-generic egg or whether it would
> be better to create an "iri-generic" egg, or some such.
>
> Perhaps Ivan can chime in?  He is the one who originally ported the
> code from a Haskell library, and might know whether that library had
> any known problems with IRIs (if it's even IRIs we're talking about!)
>
> Cheers,
> Peter
> --
> http://sjamaan.ath.cx
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Sungjin Chun
First, I might have found wrong place but...

It seems that the main source of the my problem is related to the part of
uri-generic.scm, especially;

(define char-set:uri-unreserved
  (char-set union char-set:letter+digit (string->char-set "-_.~")))

If I change this part as;

(define char-set:uri-unreserved
  (char-set union char-set:letter+digit (string->char-set "-_.~")
char-set:hangul))

then, uri/url with korean characters work. How can I set those part more
generic one?

Thank you in advance and sorry for my poor english.



On Mon, Jan 14, 2013 at 7:15 AM, Peter Bex  wrote:

> On Mon, Jan 14, 2013 at 07:04:05AM +0900, Sungjin Chun wrote:
> > For testing solr, lucene based client, I have to create url which
> contains utf-8 encoding(for Korean). But having this encoding uri-common
> cannot create uri.
> >
> > Can any one help me on this? Thanks.
>
> Hello Sungjin,
>
> As far as I recall, there's no special facility for IRIs
> (internationalized URIs, a separate RFC from 3986) in uri-generic.
> uri-generic is the underlying egg which actually handles all the
> parsing, uri-common just adds some convenience procedures for
> HTTP and URI-encoded forms.  Maybe you can take a look at the
> uri-geneirc library, and verify it really is going wrong there already?
>
> If it doesn't work, some test cases would be appreciated.  We can have
> a look at them and see where it's failing.  I'm unsure whether this
> really should be supported by the uri-generic egg or whether it would
> be better to create an "iri-generic" egg, or some such.
>
> Perhaps Ivan can chime in?  He is the one who originally ported the
> code from a Haskell library, and might know whether that library had
> any known problems with IRIs (if it's even IRIs we're talking about!)
>
> Cheers,
> Peter
> --
> http://sjamaan.ath.cx
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Alex Shinn
Hi,

On Mon, Jan 14, 2013 at 12:52 PM, Sungjin Chun  wrote:

> First, I might have found wrong place but...
>
> It seems that the main source of the my problem is related to the part of
> uri-generic.scm, especially;
>
> (define char-set:uri-unreserved
>   (char-set union char-set:letter+digit (string->char-set "-_.~")))
>
> If I change this part as;
>
> (define char-set:uri-unreserved
>   (char-set union char-set:letter+digit (string->char-set "-_.~")
> char-set:hangul))
>
> then, uri/url with korean characters work. How can I set those part more
> generic one?
>

I believe the ASCII definition is correct even for Unicode URLs.
You need to represent the URL in utf8 and then use percent
escapes on the utf8 bytes, which is what would happen naturally
here.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Sungjin Chun
As far as I know, revised RFC permits UTF-8 characters in the URL without
encoding. Am I wrong here?
Even Solr (the search engine) permits them.


On Mon, Jan 14, 2013 at 1:26 PM, Alex Shinn  wrote:

> Hi,
>
> On Mon, Jan 14, 2013 at 12:52 PM, Sungjin Chun  wrote:
>
>> First, I might have found wrong place but...
>>
>> It seems that the main source of the my problem is related to the part of
>> uri-generic.scm, especially;
>>
>> (define char-set:uri-unreserved
>>   (char-set union char-set:letter+digit (string->char-set "-_.~")))
>>
>> If I change this part as;
>>
>> (define char-set:uri-unreserved
>>   (char-set union char-set:letter+digit (string->char-set "-_.~")
>> char-set:hangul))
>>
>> then, uri/url with korean characters work. How can I set those part more
>> generic one?
>>
>
> I believe the ASCII definition is correct even for Unicode URLs.
> You need to represent the URL in utf8 and then use percent
> escapes on the utf8 bytes, which is what would happen naturally
> here.
>
> --
> Alex
>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-13 Thread Alex Shinn
On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun  wrote:

> As far as I know, revised RFC permits UTF-8 characters in the URL without
> encoding. Am I wrong here?
>

The latest URI RFC is 3986.  The relevant description in prose is:

  Local names, such as file system names, are stored with a local
  character encoding.  URI producing applications (e.g., origin
  servers) will typically use the local encoding as the basis for
  producing meaningful names.  The URI producer will transform the
  local encoding to one that is suitable for a public interface and
  then transform the public interface encoding into the restricted set
  of URI characters (reserved, unreserved, and percent-encodings).
  Those characters are, in turn, encoded as octets to be used as a
  reference within a data format (e.g., a document charset), and such
  data formats are often subsequently encoded for transmission over
  Internet protocols.

The relevant parts of the BNF are:

   pct-encoded = "%" HEXDIG HEXDIG

   reserved= gen-delims / sub-delims

   gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

   sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
   / "*" / "+" / "," / ";" / "="

   unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

   path  = path-abempty; begins with "/" or is empty
 / path-absolute   ; begins with "/" but not "//"
 / path-noscheme   ; begins with a non-colon segment
 / path-rootless   ; begins with a segment
 / path-empty  ; zero characters

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty= 0

   segment   = *pchar
   segment-nz= 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
 ; non-zero-length segment without any colon ":"

   pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

Thus you can't use raw non-ASCII bytes in a URI - they must
be encoded, and interpretation is up to the origin (and is overwhelmingly
utf8 these days).

Even Solr (the search engine) permits them.
>

It would of course be possible for any tool or webserver to
accept URIs with non-ASCII bytes, but I don't know of any
browsers which would _send_ such a request, because in
general it would be rejected.

I tried searching non-ASCII on whitehouse.gov (which uses
Solr) and indeed it generated a percent-encoded query.  My
browser (Chrome) rendered the percent escapes as utf-8 for
me though.

There's also punycode which can be used to represent Unicode
domain names (which otherwise don't even allow percent escapes).
In some cases certain browsers will render this for you (generally
if the encoded script matches the top-level country name, e.g.
for a .kr domain Hangul would be shown), but it's in general
a dangerous extension because it makes phishing attempts easier.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Peter Bex
On Mon, Jan 14, 2013 at 02:42:40PM +0900, Alex Shinn wrote:
> On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun  wrote:
> > As far as I know, revised RFC permits UTF-8 characters in the URL without
> > encoding. Am I wrong here?
> 
> Thus you can't use raw non-ASCII bytes in a URI - they must
> be encoded, and interpretation is up to the origin (and is overwhelmingly
> utf8 these days).

Wow, thanks for doing the research!  I was a bit lazy in not doing
that in the first place.  It's not the first time though that people
think something's wrong in uri-generic whereas on closer reading of
the RFC it turns out to be correct :)

There is a very common misconception held by many programmers that you
only need to encode an URI whenever the link doesn't work in a browser.
However, this is a source of vulnerabilities and subtle bugs. 
A lot of browsers simply try to cope with broken HTML and even broken
URI strings, apparently.

> It would of course be possible for any tool or webserver to
> accept URIs with non-ASCII bytes, but I don't know of any
> browsers which would _send_ such a request, because in
> general it would be rejected.

We've decided to make uri-generic follow the RFC as closely as
possible.  To our knowledge, this library is the most RFC-compliant
URI library available for *any* language.

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread .alyn.post.
On Mon, Jan 14, 2013 at 09:18:52AM +0100, Peter Bex wrote:
> On Mon, Jan 14, 2013 at 02:42:40PM +0900, Alex Shinn wrote:
> > On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun  wrote:
> > > As far as I know, revised RFC permits UTF-8 characters in the URL without
> > > encoding. Am I wrong here?
> > 
> > Thus you can't use raw non-ASCII bytes in a URI - they must
> > be encoded, and interpretation is up to the origin (and is overwhelmingly
> > utf8 these days).
> 
> Wow, thanks for doing the research!  I was a bit lazy in not doing
> that in the first place.  It's not the first time though that people
> think something's wrong in uri-generic whereas on closer reading of
> the RFC it turns out to be correct :)
> 
> There is a very common misconception held by many programmers that you
> only need to encode an URI whenever the link doesn't work in a browser.
> However, this is a source of vulnerabilities and subtle bugs. 
> A lot of browsers simply try to cope with broken HTML and even broken
> URI strings, apparently.
> 
> > It would of course be possible for any tool or webserver to
> > accept URIs with non-ASCII bytes, but I don't know of any
> > browsers which would _send_ such a request, because in
> > general it would be rejected.
> 
> We've decided to make uri-generic follow the RFC as closely as
> possible.  To our knowledge, this library is the most RFC-compliant
> URI library available for *any* language.
> 

I worked on an FTP program years ago that operated in an ecosystem
where lots of technically incorrect URLs were pasted around, and
we got a bug report that they weren't working in our client.

To 'fix' it, we had to remove support for correct URLs to handle
this more common use case.  I regret having to do that to this day, so
thank you very much for RFC-compliant parsing.

-Alan
-- 
my personal website: http://c0redump.org/

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Sungjin Chun
Thank you very much. :-)

My proposed hack(yes, no solution) just works for me but I found that it is
just wrong w.r.t RFC.
I'll try your modification and and let you know whether it works or not.

Thank you again.


On Mon, Jan 14, 2013 at 5:08 PM, Ivan Raikov wrote:

> Hi Sungjin,
>
>Thanks for trying to use the uri-generic library. As Peter already
> pointed out, uri-generic and uri-common are intended to implement RFC 3986
> (URIs), and so far no effort has been done to support RFC 3987 (IRIs).
> However, the IRI RFC does define a mapping from IRI to URI, where Unicode
> characters in IRIs are converted to percent  encoded UTF-8 sequences. The
> caveat here is that if you try to decode these percent-encoded sequences
> they will likely result in invalid URI characters. I have prototyped a
> procedure iri->uri which attempts to percent-encode all UTF-8 sequences in
> the input string and create a URI. You can see it here:
>
>
> http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8/uri-generic.scm
>
> You can try iri->uri as follows:
>
> (use uri-generic)
> (print (iri->uri "http://example.com/삼계탕";))
> (URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "�%82%BC�%B3%84�%83%95") query=#f fragment=#f)
>
>   Note that the URI constructor still tries to percent-decode all
> characters in the path, and in this example this results in unprintable
> characters being displayed. So I will probably need to add a field to the
> URI structure that indicates if UTF-8 sequences are included and avoid
> percent-decoding altogether. Would this be sufficient for your needs?
>
>   Your proposed solution to extend the definition of the 'unstructured'
> character set is in line with RFC 3987, but I need to look some more at the
> code and see whether it would be possible to have an API where the user can
> choose whether to use IRIs or URIs. I prefer not to use UTF-8 sequences by
> default, since this might result in uri-generic based client sending
> invalid URIs to a server. Let me know what the exact requirements of your
> application are, and perhaps we can some up with a simple solution.
>
>   Ivan
>
>
>
> On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun  wrote:
>
>> As far as I know, revised RFC permits UTF-8 characters in the URL without
>> encoding. Am I wrong here?
>> Even Solr (the search engine) permits them.
>>
>>
>>
>> On Mon, Jan 14, 2013 at 1:26 PM, Alex Shinn  wrote:
>>
>>> Hi,
>>>
>>> On Mon, Jan 14, 2013 at 12:52 PM, Sungjin Chun  wrote:
>>>
 First, I might have found wrong place but...

 It seems that the main source of the my problem is related to the part
 of uri-generic.scm, especially;

 (define char-set:uri-unreserved
   (char-set union char-set:letter+digit (string->char-set "-_.~")))

 If I change this part as;

 (define char-set:uri-unreserved
   (char-set union char-set:letter+digit (string->char-set "-_.~")
 char-set:hangul))

 then, uri/url with korean characters work. How can I set those part
 more generic one?

>>>
>>> I believe the ASCII definition is correct even for Unicode URLs.
>>> You need to represent the URL in utf8 and then use percent
>>> escapes on the utf8 bytes, which is what would happen naturally
>>> here.
>>>
>>> --
>>> Alex
>>>
>>>
>>
>> ___
>> Chicken-users mailing list
>> Chicken-users@nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/chicken-users
>>
>>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Alex Shinn
On Tue, Jan 15, 2013 at 7:35 AM, Sungjin Chun  wrote:

> Thank you very much. :-)
>
> My proposed hack(yes, no solution) just works for me but I found that it
> is just wrong w.r.t RFC.
> I'll try your modification and and let you know whether it works or not.
>
> Thank you again.
>
>
> On Mon, Jan 14, 2013 at 5:08 PM, Ivan Raikov wrote:
>
>> Hi Sungjin,
>>
>>Thanks for trying to use the uri-generic library. As Peter already
>> pointed out, uri-generic and uri-common are intended to implement RFC 3986
>> (URIs), and so far no effort has been done to support RFC 3987 (IRIs).
>>
>
Interesting, I wasn't even aware of RFC 3987.  Note that this extension
only applies to new schemes - in particular IRIs cannot be used for HTTP.


> However, the IRI RFC does define a mapping from IRI to URI, where Unicode
>> characters in IRIs are converted to percent  encoded UTF-8 sequences. The
>> caveat here is that if you try to decode these percent-encoded sequences
>> they will likely result in invalid URI characters. I have prototyped a
>> procedure iri->uri which attempts to percent-encode all UTF-8 sequences in
>> the input string and create a URI. You can see it here:
>>
>
This shouldn't be needed.  Sungjin was using uri-common, which already
percent-encodes UTF-8 sequences, which is what is desired.

Sungjin - going back to your original question, what did you try and
what did it do differently from what you expected?  This should just be
working.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Sungjin Chun
My intention is to create search client for Solr (search server using
lucene); where I should send
request URL like this;

  http://127.0.0.1:8983/solr/select?q=삼계탕&start=0&rows=10

I've tried to create this client using http-client egg and had found that
it does not like UTF-8 characters
in the URL, so has my journey to hack started.

If you have better approach than this, I'll be appreciated :-)


On Tue, Jan 15, 2013 at 11:45 AM, Alex Shinn  wrote:

> On Tue, Jan 15, 2013 at 7:35 AM, Sungjin Chun  wrote:
>
>> Thank you very much. :-)
>>
>> My proposed hack(yes, no solution) just works for me but I found that it
>> is just wrong w.r.t RFC.
>> I'll try your modification and and let you know whether it works or not.
>>
>> Thank you again.
>>
>>
>> On Mon, Jan 14, 2013 at 5:08 PM, Ivan Raikov wrote:
>>
>>> Hi Sungjin,
>>>
>>>Thanks for trying to use the uri-generic library. As Peter already
>>> pointed out, uri-generic and uri-common are intended to implement RFC 3986
>>> (URIs), and so far no effort has been done to support RFC 3987 (IRIs).
>>>
>>
> Interesting, I wasn't even aware of RFC 3987.  Note that this extension
> only applies to new schemes - in particular IRIs cannot be used for HTTP.
>
>
>>  However, the IRI RFC does define a mapping from IRI to URI, where
>>> Unicode characters in IRIs are converted to percent  encoded UTF-8
>>> sequences. The caveat here is that if you try to decode these
>>> percent-encoded sequences they will likely result in invalid URI
>>> characters. I have prototyped a procedure iri->uri which attempts to
>>> percent-encode all UTF-8 sequences in the input string and create a URI.
>>> You can see it here:
>>>
>>
> This shouldn't be needed.  Sungjin was using uri-common, which already
> percent-encodes UTF-8 sequences, which is what is desired.
>
> Sungjin - going back to your original question, what did you try and
> what did it do differently from what you expected?  This should just be
> working.
>
> --
> Alex
>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Alex Shinn
On Tue, Jan 15, 2013 at 11:50 AM, Sungjin Chun  wrote:

> My intention is to create search client for Solr (search server using
> lucene); where I should send
> request URL like this;
>
>   http://127.0.0.1:8983/solr/select?q=삼계탕&start=0&rows=10
>
> I've tried to create this client using http-client egg and had found that
> it does not like UTF-8 characters
> in the URL, so has my journey to hack started.
>

Ah, I see.  I had been building URIs directly with make-uri,
which accepts non-ASCII characters and encodes correctly
on output:

  (make-uri scheme: "http" host: "127.0.0.1"
 path: '("" "solr" "select") query: '((q . "삼계탕")))

If you want to parse a string which is already an invalid URI,
you need a hack.  Treating it as an invalid IRI (invalid because
it doesn't allow http) and converting to a URI would work.

Alternately, the URI parsing procedures (and their usage from
http-client) could take an optional non-strict? parameter to allow
invalid characters.  It might make sense to make this the
default for http-client, since this is what browsers typically do -
allow invalid URIs but percent-encode them on request.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Ivan Raikov
Hi all,

   I realized that I replied only to Sungjin and neglected to include the
mailing list, so let me repeat.

Section 3.1 of RFC 3987 defines a mapping between IRIs and URIs such that
UTF-8 sequences are percent-encoded.
So I implemented a procedure iri->uri, which percent-encodes a UTF-8 string
and passes it to the usual URI constructor in uri-generic.
It is intended to work as follows:

(iri->uri "http://example.com/삼계탕";) =>
#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)

However, the uri-generic constructor tries to normalize all URIs by percent
decoding them, so currently the URL above results in this:

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"�%82%BC�%B3%84�%83%95") query=#f fragment=#f)


  In other words, parts of the percent-encoded UTF-8 sequences are decoded
back to unprintable ASCII characters.
So a better solution might indeed be to change iri->uri to pass the
percent-encoded sequences directly to make-uri without attempts at
percent-decoding normalization.

  Sungjin's modification to the definition of 'unstructured' is in line
with the IRI RFC (except of course we will need to add all other character
sets besides Hangul).
However, it was already pointed out by Peter and Alex that URIs containing
native UTF-8 sequences might results in invalid URLs being sent to systems
that do not understand IRIs or UTF-8.

I will modify iri->uri to avoid normalization and see if this would produce
ok results.

  Ivan














On Tue, Jan 15, 2013 at 12:20 PM, Alex Shinn  wrote:

> =삼계탕&start=0&rows=10
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Ivan Raikov
Hi again,

   I have now extended the utf8 code in uri-generic, so that UTF-8
sequences are percent-encoded as lists of the form '(% h1 h2 [% h3 h4
...])). The percent-decoding routine is not going to decode sequences of
more that one byte, so that now percent encoding normalization will not
interfere with encoded UTF-8 sequences. I have also renamed the iri->uri
routine to utf8-string->uri. I think now its behavior is compliant with
both RFC 3986 and 3987:

(utf8-string->uri "http://example.com/삼계탕";) =>
#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)

(uri->string (utf8-string->uri "http://example.com/삼계탕";)) =>
"http://example.com/%EC%82%BC%EA%B3%84%ED%83%95";

The code is available here:

http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8

Sungjin, can you take a look at this code as see if it works for you?

  Ivan





On Tue, Jan 15, 2013 at 1:22 PM, Ivan Raikov wrote:

> Hi all,
>
>I realized that I replied only to Sungjin and neglected to include the
> mailing list, so let me repeat.
>
> Section 3.1 of RFC 3987 defines a mapping between IRIs and URIs such that
> UTF-8 sequences are percent-encoded.
> So I implemented a procedure iri->uri, which percent-encodes a UTF-8
> string and passes it to the usual URI constructor in uri-generic.
> It is intended to work as follows:
>
> (iri->uri "http://example.com/삼계탕";) =>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)
>
> However, the uri-generic constructor tries to normalize all URIs by
> percent decoding them, so currently the URL above results in this:
>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "�%82%BC�%B3%84�%83%95") query=#f fragment=#f)
>
>
>   In other words, parts of the percent-encoded UTF-8 sequences are decoded
> back to unprintable ASCII characters.
> So a better solution might indeed be to change iri->uri to pass the
> percent-encoded sequences directly to make-uri without attempts at
> percent-decoding normalization.
>
>   Sungjin's modification to the definition of 'unstructured' is in line
> with the IRI RFC (except of course we will need to add all other character
> sets besides Hangul).
> However, it was already pointed out by Peter and Alex that URIs containing
> native UTF-8 sequences might results in invalid URLs being sent to systems
> that do not understand IRIs or UTF-8.
>
> I will modify iri->uri to avoid normalization and see if this would
> produce ok results.
>
>   Ivan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Jan 15, 2013 at 12:20 PM, Alex Shinn  wrote:
>
>> =삼계탕&start=0&rows=10
>
>
>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Alex Shinn
On Tue, Jan 15, 2013 at 2:23 PM, Ivan Raikov wrote:

> Hi again,
>
>I have now extended the utf8 code in uri-generic, so that UTF-8
> sequences are percent-encoded as lists of the form '(% h1 h2 [% h3 h4
> ...])). The percent-decoding routine is not going to decode sequences of
> more that one byte, so that now percent encoding normalization will not
> interfere with encoded UTF-8 sequences. I have also renamed the iri->uri
> routine to utf8-string->uri. I think now its behavior is compliant with
> both RFC 3986 and 3987:
>
> (utf8-string->uri "http://example.com/삼계탕";) =>
>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)
>

This result looks broken.  As I noted in my previous mail, the URI
representation
already handles non-ASCII characters and escapes on output:

$ csi -R uri-common
#;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
#
#;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
"삼계탕")))
"http://127.0.0.1/82%BCB3%8483%95";

If you put percent escapes _inside_ the internal path representation,
you'll get double escaping.

Parsing is a separate matter, and utf8-string->uri should return
the URI object without error, but with the unescaped values in
the path and query as resulting from the make-uri above.

Unrelated, the actual escaped output looks buggy - it looks like
some characters like the leading "%EC%" are getting dropped.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Ivan Raikov
Hi Alex,

I understand your point about make-uri, but I want to provide a uri
constructor that takes a UTF-8 input string and maps it in accordance with
RFC 3986 / 3987.
So we still have to perform path and percent-encoding normalization steps
for the ASCII portions of the string. make-uri makes no such attempts at
normalization and so does not strictly follow RFC 3986.
I interpreted Section 3.1 from RFC 3987 to mean that UTF-8 are encoded by
taking each octet and applying percent encoding on it.

So for the string "пиле" the octets are D0 BF D0 B8 D0 BB D0 B5 and
(utf8-string->uri "http://example.com/пиле";) produces

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)

For the string "삼계탕" the octets are EC 82 BC EA B3 84 ED 83  95 and
(utf8-string->uri "http://example.com/삼계탕";) produces

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)


Can you elaborate what is broken about this? Perhaps I do not understand
UTF-8 and need to apply a bitmask or something to the octets?

Percent-encoded sequences of more than one octet will not get touched by
pct-decode in the current implementation, so you will not get double
escaping. Percent-encoded sequences of one octet will get decoded if they
fall in the "unstructured" char-set, as per RFC 3986.

  Ivan



> This result looks broken.  As I noted in my previous mail, the URI
> representation
> already handles non-ASCII characters and escapes on output:
>
> $ csi -R uri-common
> #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
> # query=#f fragment=#f>
> #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
> "삼계탕")))
> "http://127.0.0.1/82%BCB3%8483%95";
>
> If you put percent escapes _inside_ the internal path representation,
> you'll get double escaping.
>
> Parsing is a separate matter, and utf8-string->uri should return
> the URI object without error, but with the unescaped values in
> the path and query as resulting from the make-uri above.
>
> Unrelated, the actual escaped output looks buggy - it looks like
> some characters like the leading "%EC%" are getting dropped.
>
> --
> Alex
>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-14 Thread Ivan Raikov
Oops, the second example should have been

For the string "삼계탕" the octets are EC 82 BC EA B3 84 ED 83  95 and
(utf8-string->uri "http://example.com/삼계탕";) produces

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
"%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)

Sorry about the confusion.

  Ivan




On Tue, Jan 15, 2013 at 3:03 PM, Ivan Raikov wrote:

>
> Hi Alex,
>
> I understand your point about make-uri, but I want to provide a uri
> constructor that takes a UTF-8 input string and maps it in accordance with
> RFC 3986 / 3987.
> So we still have to perform path and percent-encoding normalization steps
> for the ASCII portions of the string. make-uri makes no such attempts at
> normalization and so does not strictly follow RFC 3986.
> I interpreted Section 3.1 from RFC 3987 to mean that UTF-8 are encoded by
> taking each octet and applying percent encoding on it.
>
> So for the string "пиле" the octets are D0 BF D0 B8 D0 BB D0 B5 and
> (utf8-string->uri "http://example.com/пиле";) produces
>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)
>
> For the string "삼계탕" the octets are EC 82 BC EA B3 84 ED 83  95 and
> (utf8-string->uri "http://example.com/삼계탕";) produces
>
> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/
> "%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)
>
>
> Can you elaborate what is broken about this? Perhaps I do not understand
> UTF-8 and need to apply a bitmask or something to the octets?
>
> Percent-encoded sequences of more than one octet will not get touched by
> pct-decode in the current implementation, so you will not get double
> escaping. Percent-encoded sequences of one octet will get decoded if they
> fall in the "unstructured" char-set, as per RFC 3986.
>
>   Ivan
>
>
>
>> This result looks broken.  As I noted in my previous mail, the URI
>> representation
>> already handles non-ASCII characters and escapes on output:
>>
>> $ csi -R uri-common
>> #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
>> #> query=#f fragment=#f>
>> #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
>> "삼계탕")))
>> "http://127.0.0.1/82%BCB3%8483%95";
>>
>> If you put percent escapes _inside_ the internal path representation,
>> you'll get double escaping.
>>
>> Parsing is a separate matter, and utf8-string->uri should return
>> the URI object without error, but with the unescaped values in
>> the path and query as resulting from the make-uri above.
>>
>> Unrelated, the actual escaped output looks buggy - it looks like
>> some characters like the leading "%EC%" are getting dropped.
>>
>> --
>> Alex
>>
>> #(URI scheme=http authority=#(URIAuth host="example.com" port=#f)
> path=(/ "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)
>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Peter Bex
On Tue, Jan 15, 2013 at 03:03:59PM +0900, Ivan Raikov wrote:
> Hi Alex,
> 
> I understand your point about make-uri, but I want to provide a uri
> constructor that takes a UTF-8 input string and maps it in accordance with
> RFC 3986 / 3987.

Personally I think it would make more sense to have iri versions of all
the uri-* procedures (ie, uri-reference <-> iri-reference,
absolute-uri <-> absolute-iri, etc).

This makes for a more regular and simple API and in general feels less
hackish.  Preprocessing sounds a little dangerous and error-prone, but
I can't immediately come up with a case where this will break.

At least we'll need to extend the test-suite considerably if this should
end up in uri-generic, IMHO.  Especially what happens when things are
already percent-encoded, and what happens when percent signs are
succeeded by UTF8 characters.  Also things like IRI scheme and path
components need to be checked.

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Alex Shinn
On Tue, Jan 15, 2013 at 3:03 PM, Ivan Raikov wrote:

>
> Percent-encoded sequences of more than one octet will not get touched by
> pct-decode in the current implementation, so you will not get double
> escaping. Percent-encoded sequences of one octet will get decoded if they
> fall in the "unstructured" char-set, as per RFC 3986.
>

OK, now I'm thoroughly confused.  The percent-encoding is context sensitive?
How can this not be broken?

We need to make the design clear:

  * What can be constructed directly with make-uri.
  * What can be parsed, and how this is passed to make-uri.
  * How URIs are represented internally.
  * How URIs are encoded on output.

It sounds like uri-common and uri-generic are doing different things here.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Peter Bex
On Tue, Jan 15, 2013 at 06:07:06PM +0900, Alex Shinn wrote:
> On Tue, Jan 15, 2013 at 3:03 PM, Ivan Raikov wrote:
> 
> >
> > Percent-encoded sequences of more than one octet will not get touched by
> > pct-decode in the current implementation, so you will not get double
> > escaping. Percent-encoded sequences of one octet will get decoded if they
> > fall in the "unstructured" char-set, as per RFC 3986.
> >
> 
> OK, now I'm thoroughly confused.  The percent-encoding is context sensitive?
> How can this not be broken?
> 
> We need to make the design clear:
> 
>   * What can be constructed directly with make-uri.
>   * What can be parsed, and how this is passed to make-uri.
>   * How URIs are represented internally.
>   * How URIs are encoded on output.
> 
> It sounds like uri-common and uri-generic are doing different things here.

uri-generic is agnostic about specific encodings and types.
uri-common is designed to make life simpler in the case of "common" URIs
like HTTP where we know what types of characters are to be decoded.

RFC3986 "special characters" cannot be decoded unless we know they have
no special meaning.  uri-common just decodes everything fully because
there is generally no deeper nested encoding involved.  It's also smart
enough to know that port 80 belongs to http, so it can be omitted,
whereas uri-generic can't make such assumptions.

uri-common also makes the assumption that query args are
x-www-form-urlencoded.  This is the main reason to prefer it for web
programming; uri-generic doesn't know about form-encoding because that
is really only used in the context of HTML (it's strictly not even a
HTTP thing), so this messy stuff should stay out of the generic URI
library.

Yes, the web is evil and must die.

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Alex Shinn
On Tue, Jan 15, 2013 at 6:23 PM, Peter Bex  wrote:

> On Tue, Jan 15, 2013 at 06:07:06PM +0900, Alex Shinn wrote:
> > On Tue, Jan 15, 2013 at 3:03 PM, Ivan Raikov  >wrote:
> >
> > >
> > > Percent-encoded sequences of more than one octet will not get touched
> by
> > > pct-decode in the current implementation, so you will not get double
> > > escaping. Percent-encoded sequences of one octet will get decoded if
> they
> > > fall in the "unstructured" char-set, as per RFC 3986.
> > >
> >
> > OK, now I'm thoroughly confused.  The percent-encoding is context
> sensitive?
> > How can this not be broken?
> >
> > We need to make the design clear:
> >
> >   * What can be constructed directly with make-uri.
> >   * What can be parsed, and how this is passed to make-uri.
> >   * How URIs are represented internally.
> >   * How URIs are encoded on output.
> >
> > It sounds like uri-common and uri-generic are doing different things
> here.
>
> uri-generic is agnostic about specific encodings and types.
> uri-common is designed to make life simpler in the case of "common" URIs
> like HTTP where we know what types of characters are to be decoded.
>
> RFC3986 "special characters" cannot be decoded unless we know they have
> no special meaning.  uri-common just decodes everything fully because
> there is generally no deeper nested encoding involved.  It's also smart
> enough to know that port 80 belongs to http, so it can be omitted,
> whereas uri-generic can't make such assumptions.
>
> uri-common also makes the assumption that query args are
> x-www-form-urlencoded.  This is the main reason to prefer it for web
> programming; uri-generic doesn't know about form-encoding because that
> is really only used in the context of HTML (it's strictly not even a
> HTTP thing), so this messy stuff should stay out of the generic URI
> library.
>
> Yes, the web is evil and must die.
>

Right, I'm familiar with the evil standards :)  I'm also hoping that we can
have some basic compatibility between Chicken's uri module and Chibi's
(and whatever R7RS WG2 comes up with).

It seems to me the sane thing to do is represent URIs unencoded
internally, which can be generated directly with make-uri or decoded
on parsing.  The decoding might be schema-specific, although
really the only difference is the space-to-+ and query args encoding.

Then, on output we would encode as needed.

I was confused because the uri-generic change Ivan suggests
seems to be putting encoded characters directly in the representation,
whereas uri-common is encoding only on output.

[It also looks like the uri-common encoding is broken - why were bytes
getting lost?]

Finally, regarding parsing I still don't understand why %AB is decoded
into the corresponding octet but %AB%CD is not?

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Peter Bex
On Tue, Jan 15, 2013 at 07:30:07PM +0900, Alex Shinn wrote:
> Right, I'm familiar with the evil standards :)  I'm also hoping that we can
> have some basic compatibility between Chicken's uri module and Chibi's
> (and whatever R7RS WG2 comes up with).

That would be nice indeed.

> It seems to me the sane thing to do is represent URIs unencoded
> internally, which can be generated directly with make-uri or decoded
> on parsing.

That cannot be done in general.  If you decode something like %2F, that
will wreak havoc with path-structured URIs.  The same will happen with
other types of "special" characters; you need to be able to distinguish
between the "special" character as-is and encoded.

These special characters are called "reserved" in the BNF.  As you can
see, the question mark, equals sign and ampersand is in there.
For query urlencoded query strings, these *cannot* be decoded, because
then you can't distinguish between

http://calc.example.com?bool-expr=x%26y%3D
and 
http://calc.example.com?bool-expr=x&y=1

The former should be decoded in uri-common to the alist
((bool-expr . "x&y=1")) and the latter to ((bool-expr . "x") (y . "1")).
By fully decoding all reserved characters in uri-generic, you drop
important information.

All unreserved characters are already fully decoded by uri-generic,
but this leaves the extra decoding of things like the ampersand above
inside the query string components after form-decoding to be done by
uri-common.

> The decoding might be schema-specific, although
> really the only difference is the space-to-+ and query args encoding.

No, the conversion to a friendly alist is specific to uri-common.

> I was confused because the uri-generic change Ivan suggests
> seems to be putting encoded characters directly in the representation,
> whereas uri-common is encoding only on output.

I don't understand this either.  I'm at work, so maybe it's just due to
a lack of complete attention.

> [It also looks like the uri-common encoding is broken - why were bytes
> getting lost?]

Probably because it doesn't correctly deal with UTF-8 in the decoding of
URLencoded form data.  I'll need a proper test case and some time to
look into it.

> Finally, regarding parsing I still don't understand why %AB is decoded
> into the corresponding octet but %AB%CD is not?

Unsure.

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Alex Shinn
On Tue, Jan 15, 2013 at 7:48 PM, Peter Bex  wrote:

>
> These special characters are called "reserved" in the BNF.  As you can
> see, the question mark, equals sign and ampersand is in there.
> For query urlencoded query strings, these *cannot* be decoded, because
> then you can't distinguish between
>
> http://calc.example.com?bool-expr=x%26y%3D
> and
> http://calc.example.com?bool-expr=x&y=1
>
> The former should be decoded in uri-common to the alist
> ((bool-expr . "x&y=1")) and the latter to ((bool-expr . "x") (y . "1")).
> By fully decoding all reserved characters in uri-generic, you drop
> important information.
>

The internal representation is either decoded, or it is encoded.
Either can be made to work.

In this case, the decoded uri-common representation of the former is:

  ((bool-expr . "x&y=1"))

and the decoded representation of the latter is:

  ((bool-expr . "x") (y . "1"))

just as you say, so this is how they are stored in the URI object.

In uri-generic, both get parsed to:

  ((bool-expr . "x&y=1"))

As the RFC states:

   Because the percent ("%") character serves as the indicator for
   percent-encoded octets, it must be percent-encoded as "%25" for that
   octet to be used as data within a URI.

Therefore, if you intended the raw URI data to include a "%",
then the correct representation (for either common or generic)
would have been:

  
http://calc.example.com?bool-expr=x%2526y%253D

So assuming & is _not_ special to the query (as is the case
with uri-generic), escaping & with %25 or not produces the
same result.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Peter Bex
On Wed, Jan 16, 2013 at 12:39:16AM +0900, Alex Shinn wrote:
> The internal representation is either decoded, or it is encoded.
> Either can be made to work.
> 
> In this case, the decoded uri-common representation of the former is:
> 
>   ((bool-expr . "x&y=1"))
> 
> and the decoded representation of the latter is:
> 
>   ((bool-expr . "x") (y . "1"))
> 
> just as you say, so this is how they are stored in the URI object.
> 
> In uri-generic, both get parsed to:
> 
>   ((bool-expr . "x&y=1"))

This cannot work because uri-common is re-using uri-generic's parser.
Also, uri-generic doesn't do alist-decoding at all, because form-encoding
is a HTML affair and has nothing to do with HTTP or URI standards.

> Therefore, if you intended the raw URI data to include a "%",
> then the correct representation (for either common or generic)
> would have been:
> 
>   
> http://calc.example.com?bool-expr=x%2526y%253D
> 
> So assuming & is _not_ special to the query (as is the case
> with uri-generic), escaping & with %25 or not produces the
> same result.

If you can make it work for both libraries, feel free to do so, but
my energy to work on web stuff is very very low at the moment.

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-15 Thread Alex Shinn
On Wed, Jan 16, 2013 at 12:59 AM, Peter Bex  wrote:

> On Wed, Jan 16, 2013 at 12:39:16AM +0900, Alex Shinn wrote:
> > The internal representation is either decoded, or it is encoded.
> > Either can be made to work.
> >
> > In this case, the decoded uri-common representation of the former is:
> >
> >   ((bool-expr . "x&y=1"))
> >
> > and the decoded representation of the latter is:
> >
> >   ((bool-expr . "x") (y . "1"))
> >
> > just as you say, so this is how they are stored in the URI object.
> >
> > In uri-generic, both get parsed to:
> >
> >   ((bool-expr . "x&y=1"))
>
> This cannot work because uri-common is re-using uri-generic's parser.
> Also, uri-generic doesn't do alist-decoding at all, because form-encoding
> is a HTML affair and has nothing to do with HTTP or URI standards.
>

Ah, OK, there may be implementation details on why you
store encoded or decoded.

Anyway, this isn't really important.  I'm mostly concerned
with making utf8 do the right thing, and was wondering what
the API was because it's not clear from the docs.

Put another way, do uri-path and uri-query return the
encoded or decoded values (maybe differently for uri-common
and uri-generic)?

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-16 Thread Peter Bex
On Wed, Jan 16, 2013 at 11:22:57AM +0900, Alex Shinn wrote:
> Anyway, this isn't really important.  I'm mostly concerned
> with making utf8 do the right thing, and was wondering what
> the API was because it's not clear from the docs.

OK, I think it's worth figuring this out.

> Put another way, do uri-path and uri-query return the
> encoded or decoded values (maybe differently for uri-common
> and uri-generic)?

The decoded values.  In the case of uri-generic they're only
partially decoded (reserved chars stay encoded).

Example:

(use uri-generic)
(uri-path (uri-reference "%66%6F%6F")) => "foo"
(uri-path (uri-reference "%20")) => "%20"

(use uri-common)
(uri-path (uri-reference "%66%6F%6F")) => "foo"
(uri-path (uri-reference "%20")) => " "

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-16 Thread Peter Bex
On Tue, Jan 15, 2013 at 02:44:08PM +0900, Alex Shinn wrote:
> This result looks broken.  As I noted in my previous mail, the URI
> representation already handles non-ASCII characters and escapes on output:
> 
> $ csi -R uri-common
> #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
> # query=#f fragment=#f>
> #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
> "삼계탕")))
> "http://127.0.0.1/82%BCB3%8483%95";
> 
> Unrelated, the actual escaped output looks buggy - it looks like
> some characters like the leading "%EC%" are getting dropped.

OK, I took some time to investigate and I pinpointed this problem.
This appears to happen due to the use of core srfi-14 and srfi-13 in
uri-generic; its char-set operations simply don't deal with anything
beyond ASCII.  Only by switching to the UTF versions utf8-srfi-14,
utf8-srfi-13 and unicode-char-sets this works:

Without patch:
$ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
"�%82%BC�%B3%84�%83%95"

With patch:
$ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
"%EC%82%BC%EA%B3%84%ED%83%95"

Ivan, what do you think about adding the UTF8 dependency, as per the
attached patch (against trunk)?

Cheers,
Peter
-- 
http://sjamaan.ath.cx
Index: uri-generic.scm
===
--- uri-generic.scm (revision 28113)
+++ uri-generic.scm (working copy)
@@ -57,13 +57,9 @@
 
 (import chicken scheme extras data-structures ports)
  
-(require-extension matchable defstruct srfi-1 srfi-4 srfi-13 srfi-14)
+(require-extension matchable defstruct srfi-1 srfi-4
+   utf8-srfi-13 utf8-srfi-14 unicode-char-sets)
 
-;; What to do with these?
-#;(cond-expand
-   (utf8-strings (use utf8-srfi-13 utf8-srfi-14))
-   (else (use srfi-13 srfi-14)))
-
 (defstruct URI  scheme authority path query fragment)
 (defstruct URIAuth  username password host port)
 
Index: uri-generic.meta
===
--- uri-generic.meta(revision 28113)
+++ uri-generic.meta(working copy)
@@ -17,7 +17,7 @@
 
  ; A list of eggs uri-generic depends on.
 
- (needs matchable defstruct)
+ (needs matchable defstruct utf8)
  (test-depends test)
 
  (author "Ivan Raikov and Peter Bex")
Index: tests/run.scm
===
--- tests/run.scm   (revision 28113)
+++ tests/run.scm   (working copy)
@@ -201,7 +201,10 @@
   '(("foo?bar" "foo%3Fbar")
 ("foo&bar" "foo%26bar")
 ("foo%20bar" "foo%2520bar")
-("foo\x00bar\n" "foo%00bar%0A")))
+("foo\x00bar\n" "foo%00bar%0A")
+;; Non-ASCII (Unicode) characters should also be pct-encoded
+;; (reported by Sungjin Chun)
+("삼계탕" "%EC%82%BC%EA%B3%84%ED%83%95")))
 
 (test-group "uri-encode-string test"
   (for-each (lambda (p)
@@ -588,4 +591,4 @@
 
 (test-end "uri-generic")
 
-(unless (zero? (test-failure-count)) (exit 1))
\ No newline at end of file
+(unless (zero? (test-failure-count)) (exit 1))
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-16 Thread Ivan Raikov
Hi Peter,

I think that allowing raw UTF-8 sequences in uri-generic breaks
compatibility with RFC 3986. In other words, if you construct a URI with a
UTF-8 sequence that happens to include reserved ASCII characters, those
ASCII characters will not get escaped, and you could potentially be sending
an invalid URI to a legacy system that does not understand UTF-8. For
example, the UTF-8 string "пиле" consists of the octets  D0 BF D0 B8 D0 BB
D0 B5. The ASCII codes corresponding to these octets are all outside of the
allowed character set defined in RFC 3986 and will correctly get rejected
by the uri-reference constructor. However, if we allow UTF-8 string
operations in uri-generic, and extend the unreserved character set to
include Unicode, these octets will form a valid character sequence and will
get accepted by uri-reference without being escaped. If you then send the
result of uri->string  to a system that does not understand UTF-8, the URI
will get rejected.

  My proposed solution is to include a UTF-8 aware constructor to
uri-generic and prevent percent decoding of UTF-8 sequences. I believe that
this solution is compatible with the IRI to URI mapping scheme described in
Section 3.1 of RFC 3987, but indeed I need to extend the uri-generic test
suite with more UTF-8 examples to ensure that nothing is broken. I think
that any solution will have to give the user choice whether to use ASCII or
UTF-8, and not just default to UTF-8.

   Ivan

On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex  wrote:

>
> OK, I took some time to investigate and I pinpointed this problem.
> This appears to happen due to the use of core srfi-14 and srfi-13 in
> uri-generic; its char-set operations simply don't deal with anything
> beyond ASCII.  Only by switching to the UTF versions utf8-srfi-14,
> utf8-srfi-13 and unicode-char-sets this works:
>
> Without patch:
> $ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
> "�%82%BC�%B3%84�%83%95"
>
> With patch:
> $ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
> "%EC%82%BC%EA%B3%84%ED%83%95"
>
> Ivan, what do you think about adding the UTF8 dependency, as per the
> attached patch (against trunk)?
>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-17 Thread Peter Bex
On Thu, Jan 17, 2013 at 09:35:36AM +0900, Ivan Raikov wrote:
> Hi Peter,
> 
> I think that allowing raw UTF-8 sequences in uri-generic breaks
> compatibility with RFC 3986. In other words, if you construct a URI with a
> UTF-8 sequence that happens to include reserved ASCII characters, those
> ASCII characters will not get escaped, and you could potentially be sending
> an invalid URI to a legacy system that does not understand UTF-8.

Hi Ivan,

I agree with your assessment, but the way it currently silently mangles
input isn't ideal.  I think it would be good if all constructors raised
an exception when receiving octets with the high bit set (this is
non-ASCII, which means it falls outside the scope of RFC 3986 so it's
acceptable to raise an exception).  What are your thoughts on this?
If we do this, of course the error message should include a pointer to
the new UTF conversion API so people know what to do.

>   My proposed solution is to include a UTF-8 aware constructor to
> uri-generic and prevent percent decoding of UTF-8 sequences. I believe that
> this solution is compatible with the IRI to URI mapping scheme described in
> Section 3.1 of RFC 3987, but indeed I need to extend the uri-generic test
> suite with more UTF-8 examples to ensure that nothing is broken. I think
> that any solution will have to give the user choice whether to use ASCII or
> UTF-8, and not just default to UTF-8.

This seems like a good compromise.  Unfortunately it means the API will
grow quite a bit and make it less easy to use.  I'll need to consider
what to do with http-client's "implicit" URI conversion though
(it accepts either strings or URI objects).  I guess for now I'll keep
it the way it is.  If people need UTF8 they can use the new conversion
procedures.  Maybe later I can change it, this should not cause any
breakage (unless talking to legacy systems, but those don't accept UTF
anyway so if you have UTF-8 input, there's a problem anyway)

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-22 Thread Ivan Raikov
Hi Peter,

   I think uri-generic does not silently mangle input upon receiving UTF-8,
it just returns #f. I think it is not a bad idea to raise an exception
instead.
I have not yet had the chance to thoroughly test the UTF-8 mapping
constructor, but will try to do this during the weekend.

Ivan


On Thu, Jan 17, 2013 at 5:45 PM, Peter Bex  wrote:

> On Thu, Jan 17, 2013 at 09:35:36AM +0900, Ivan Raikov wrote:
> > Hi Peter,
> >
> > I think that allowing raw UTF-8 sequences in uri-generic breaks
> > compatibility with RFC 3986. In other words, if you construct a URI with
> a
> > UTF-8 sequence that happens to include reserved ASCII characters, those
> > ASCII characters will not get escaped, and you could potentially be
> sending
> > an invalid URI to a legacy system that does not understand UTF-8.
>
> Hi Ivan,
>
> I agree with your assessment, but the way it currently silently mangles
> input isn't ideal.  I think it would be good if all constructors raised
> an exception when receiving octets with the high bit set (this is
> non-ASCII, which means it falls outside the scope of RFC 3986 so it's
> acceptable to raise an exception).  What are your thoughts on this?
> If we do this, of course the error message should include a pointer to
> the new UTF conversion API so people know what to do.
>
> >   My proposed solution is to include a UTF-8 aware constructor to
> > uri-generic and prevent percent decoding of UTF-8 sequences. I believe
> that
> > this solution is compatible with the IRI to URI mapping scheme described
> in
> > Section 3.1 of RFC 3987, but indeed I need to extend the uri-generic test
> > suite with more UTF-8 examples to ensure that nothing is broken. I think
> > that any solution will have to give the user choice whether to use ASCII
> or
> > UTF-8, and not just default to UTF-8.
>
> This seems like a good compromise.  Unfortunately it means the API will
> grow quite a bit and make it less easy to use.  I'll need to consider
> what to do with http-client's "implicit" URI conversion though
> (it accepts either strings or URI objects).  I guess for now I'll keep
> it the way it is.  If people need UTF8 they can use the new conversion
> procedures.  Maybe later I can change it, this should not cause any
> breakage (unless talking to legacy systems, but those don't accept UTF
> anyway so if you have UTF-8 input, there's a problem anyway)
>
> Cheers,
> Peter
> --
> http://sjamaan.ath.cx
>
> ___
> Chicken-users mailing list
> Chicken-users@nongnu.org
> https://lists.nongnu.org/mailman/listinfo/chicken-users
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-22 Thread Alex Shinn
On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex  wrote:

> On Tue, Jan 15, 2013 at 02:44:08PM +0900, Alex Shinn wrote:
> > This result looks broken.  As I noted in my previous mail, the URI
> > representation already handles non-ASCII characters and escapes on
> output:
> >
> > $ csi -R uri-common
> > #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
> > # > query=#f fragment=#f>
> > #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
> > "삼계탕")))
> > "http://127.0.0.1/82%BCB3%8483%95";
> >
> > Unrelated, the actual escaped output looks buggy - it looks like
> > some characters like the leading "%EC%" are getting dropped.
>
> OK, I took some time to investigate and I pinpointed this problem.
> This appears to happen due to the use of core srfi-14 and srfi-13 in
> uri-generic; its char-set operations simply don't deal with anything
> beyond ASCII.


As an aside from the uri discussion, we really need to fix srfi-14.

The reference implementation is terrible.  Not only does it not
handle Unicode, but it doesn't not-handle it gracefully:

#;1> (char-set-contains? char-set:full #\x100)
Error: (string-ref) out of range [...]

At a minimum we should avoid these errors, but really we
should be using a Unicode-aware implementation - there's no
barrier to doing so like there is for Unicode strings.  We could
just move utf8-srfi-14 into the core, or I could patch up the
srfi-14 implementation to handle wide chars properly (but maybe
slowly) without bringing in the iset dependency.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-23 Thread Alex Shinn
On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov wrote:

> Yes, I ran into this when I was adding UTF-8 support to mbox... If you
> were to add wide char support in srfi-14, is there a way to quantify the
> performance penalty?
>

To add the bounds check so it doesn't error?  Practically
nothing.

To branch to a separate path for a wide-char table if
the bounds check fails?  Same cost if the input is ASCII.

For efficient handling in the case of Unicode input...
how small/fast do you want it?

-- 
Alex

On Wed, Jan 23, 2013 at 3:42 PM, Alex Shinn  wrote:

> On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex  wrote:
>>
>>> On Tue, Jan 15, 2013 at 02:44:08PM +0900, Alex Shinn wrote:
>>> > This result looks broken.  As I noted in my previous mail, the URI
>>> > representation already handles non-ASCII characters and escapes on
>>> output:
>>> >
>>> > $ csi -R uri-common
>>> > #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
>>> > #>> > query=#f fragment=#f>
>>> > #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
>>> > "삼계탕")))
>>> > "http://127.0.0.1/82%BCB3%8483%95";
>>> >
>>> > Unrelated, the actual escaped output looks buggy - it looks like
>>> > some characters like the leading "%EC%" are getting dropped.
>>>
>>> OK, I took some time to investigate and I pinpointed this problem.
>>> This appears to happen due to the use of core srfi-14 and srfi-13 in
>>> uri-generic; its char-set operations simply don't deal with anything
>>> beyond ASCII.
>>
>>
>> As an aside from the uri discussion, we really need to fix srfi-14.
>>
>> The reference implementation is terrible.  Not only does it not
>> handle Unicode, but it doesn't not-handle it gracefully:
>>
>> #;1> (char-set-contains? char-set:full #\x100)
>> Error: (string-ref) out of range [...]
>>
>> At a minimum we should avoid these errors, but really we
>> should be using a Unicode-aware implementation - there's no
>> barrier to doing so like there is for Unicode strings.  We could
>> just move utf8-srfi-14 into the core, or I could patch up the
>> srfi-14 implementation to handle wide chars properly (but maybe
>> slowly) without bringing in the iset dependency.
>>
>> --
>> Alex
>>
>>
>> ___
>> Chicken-users mailing list
>> Chicken-users@nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/chicken-users
>>
>>
>
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-23 Thread Peter Bex
On Wed, Jan 23, 2013 at 03:29:01PM +0900, Ivan Raikov wrote:
> Hi Peter,
> 
>I think uri-generic does not silently mangle input upon receiving UTF-8,
> it just returns #f.

When parsing, yes.  I think this should stay the way it is (see below).
What I was referring to here was the example in my earlier mail when
passing strings with UTF-8 characters to uri-encode-string, and also when
passing them to the make-uri constructor directly.

Look back to the mail, it contains an example of a string that gets
mangled, which is what led Alex to question the correctness of
uri-common/uri-generic's decoding of special characters.

> I think it is not a bad idea to raise an exception instead.

This would be extremely painful for the end-user.  Either a string
matches the BNF grammar in the RFC and it parses, or it doesn't match:
it'll return an uri object or #f.  Handling a third case with the
associated exception-catching machinery will make URI-handling code
needlessly complex.

> I have not yet had the chance to thoroughly test the UTF-8 mapping
> constructor, but will try to do this during the weekend.

Cool!

Cheers,
Peter
-- 
http://sjamaan.ath.cx

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

2013-01-25 Thread Alex Shinn
On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn  wrote:

> On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov wrote:
>
>> Yes, I ran into this when I was adding UTF-8 support to mbox... If you
>> were to add wide char support in srfi-14, is there a way to quantify the
>> performance penalty?
>>
>
> To add the bounds check so it doesn't error?  Practically
> nothing.
>
> To branch to a separate path for a wide-char table if
> the bounds check fails?  Same cost if the input is ASCII.
>
> For efficient handling in the case of Unicode input...
> how small/fast do you want it?
>

I've never met such stony silence in response to an offer to do work...

I ran the following simple char-set-contains? benchmark with
a few variations:

  (time
   (do ((i 0 (+ i 1)))
   ((= i 1))
   (do ((j 0 (+ j 1)))
   ((= j 256))
 (char-set-contains? char-set:letter (integer->char j)

This is what most people are concerned about for speed, as
the boolean and construction operations are less common.

The results:

;; reference implementation
;; 0.312s CPU time, 1/2059 GCs (major/minor)

;; "fixed" reference implementation (no error but no support for
non-latin-1)
;; 0.257s CPU time, 1/1706 GCs (major/minor)

;; utf8-srfi-14 with full Unicode char-set:letter
;; 0.243s CPU time, 0/1526 GCs (major/minor)

;; utf8-srfi-14 with ASCII-only char-set:letter
;; 0.242s CPU time, 0/1526 GCs (major/minor)

I was able to add the check and make the reference
implementation faster because I fixed the common case -
it was optimized for checking for 0 instead of 1.

Even with the enormous and complex definition of a
Unicode "letter", utf8-srfi-14 is faster than srfi-14.

As for what we want in Chicken, the answer depends
on what you're optimizing for.  utf8-srfi-14 will always
win for space, and generally for speed as well.

If the biggest concern is code-size, then you might want
to borrow the char-set definition from irregex and use
that as a "fallback" for non-latin-1 chars in the srfi-14
reference impl.  This would have the same perf as
srfi-14 for latin-1, yet still support full Unicode and not
increase the size of the Chicken distribution.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users