Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]

2013-01-28 Thread .alyn.post.
I'll throw in my two bits here.

I'm not personally decided whether utf-8 in core would be an
improvement.  I don't have enough background or knowledge of
the internals to contribute to that decision.

I can offer this, however:

I have found that I have to use utf-8 support in every project
I've written in Chicken.  I do so, and have only had a problem
when the utf-8 egg did not map a procedure from core properly.

I'm getting by just fine with the current state of affairs, and
I do have a certain nostalgic love of ASCII.  If I *could* get
away with only having ASCII, I would.  This has not been true
in practice.

My experience with numbers is slightly different, where I do
find I need to do word-level calculation where I depend on the
underlying machine implementation of character- and pointer-sized
integers.  I use the fx versions of these functions when I do
rely on this, but I mainly have found I must intentionally subvert
the numeric tower to get a specific behavior.  This has never been
true when I've dealt with characters.

FWIW,

-Alan

On Sun, Jan 27, 2013 at 10:43:41AM +0900, Ivan Raikov wrote:
Hi Alex,
 
*** Yes, I would have thought that more people would be interested in
having UTF-8 support in core Chicken (or at least wide-char compatible
srfi-14). I have changed the title of this thread to reflect the subject
more accurately :-)
 
* Personally, I think that adding UTF-8* in core is much better than the
hacks I had to do in mbox, and is a no brainer considering the benchmark
results you have below.* But I am sure that opinions vary on this
subject...
 
** Can you post your bounds-check patches to srfi-14 on the mailing list,
and/or create a ticket for it? Hopefully there will be more responses this
time.
 
*** Ivan
On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn [1]alexsh...@gmail.com
wrote:
 
  On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn [2]alexsh...@gmail.com
  wrote:
 
On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov
[3]ivan.g.rai...@gmail.com wrote:
 
  Yes, I ran into this when I was adding UTF-8 support to mbox... If
  you were to add wide char support in srfi-14, is there a way to
  quantify the performance penalty?
 
To add the bounds check so it doesn't error? *Practically
nothing.
To branch to a separate path for a wide-char table if
the bounds check fails? *Same cost if the input is ASCII.
For efficient handling in the case of Unicode input...
how small/fast do you want it?
 
  I've never met such stony silence in response to an offer to do work...
  I ran the following simple char-set-contains? benchmark with
  a few variations:
  * (time
  * *(do ((i 0 (+ i 1)))
  * * * *((= i 1))
  * * * *(do ((j 0 (+ j 1)))
  * * * * * *((= j 256))
  * * * * *(char-set-contains? char-set:letter (integer-char j)
  This is what most people are concerned about for speed, as
  the boolean and construction operations are less common.
  The results:
  ;; reference implementation
  ;; 0.312s CPU time, 1/2059 GCs (major/minor)
  ;; fixed reference implementation (no error but no support for
  non-latin-1)
  ;; 0.257s CPU time, 1/1706 GCs (major/minor)
  ;; utf8-srfi-14 with full Unicode char-set:letter
  ;; 0.243s CPU time, 0/1526 GCs (major/minor)
  ;; utf8-srfi-14 with ASCII-only char-set:letter
  ;; 0.242s CPU time, 0/1526 GCs (major/minor)
  I was able to add the check and make the reference
  implementation faster because I fixed the common case -
  it was optimized for checking for 0 instead of 1.
  Even with the enormous and complex definition of a
  Unicode letter, utf8-srfi-14 is faster than srfi-14.
  As for what we want in Chicken, the answer depends
  on what you're optimizing for. *utf8-srfi-14 will always
  win for space, and generally for speed as well.
  If the biggest concern is code-size, then you might want
  to borrow the char-set definition from irregex and use
  that as a fallback for non-latin-1 chars in the srfi-14
  reference impl. *This would have the same perf as
  srfi-14 for latin-1, yet still support full Unicode and not
  increase the size of the Chicken distribution.
  --*
  Alex
 
 References
 
Visible links
1. mailto:alexsh...@gmail.com
2. mailto:alexsh...@gmail.com
3. mailto:ivan.g.rai...@gmail.com

 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users


-- 
my personal website: http://c0redump.org/

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] UTF-8 support in Chicken core [Was: [Q] uri-common has problem with UTF-8 uri.]

2013-01-27 Thread Alex Shinn
On Sun, Jan 27, 2013 at 10:43 AM, Ivan Raikov ivan.g.rai...@gmail.comwrote:


 Hi Alex,

 Yes, I would have thought that more people would be interested in
 having UTF-8 support in core Chicken (or at least wide-char compatible
 srfi-14). I have changed the title of this thread to reflect the subject
 more accurately :-)

   Personally, I think that adding UTF-8  in core is much better than the
 hacks I had to do in mbox, and is a no brainer considering the benchmark
 results you have below.  But I am sure that opinions vary on this subject...

Can you post your bounds-check patches to srfi-14 on the mailing list,
 and/or create a ticket for it? Hopefully there will be more responses this
 time.


Well, I'm not necessarily proposing UTF-8 support in the core.
I understand that has pros and cons and opinions may differ.

I was just pointing out that we're already got 3 char-set
implementations, 2 of them in the core distribution, and
there are no real cons to simplifying this and replacing
srfi-14 with one of the Unicode-capable implementations.

The simplest change I made was replacing:

(define-inline (si=0? s i) (zero? (%char-latin1 (string-ref s i
(define-inline (si=1? s i) (not (si=0? s i)))

with:

(define-inline (si=0? s i) (if (= i 256) #t (zero? (%char-latin1
(string-ref s i)
(define-inline (si=1? s i) (and ( i 256) (eq? 1 (%char-latin1 (string-ref
s i)

which is actually faster and while it doesn't support
wide char-sets, at least gives the correct answers when
passed wide chars.

-- 
Alex


 Ivan

 On Sat, Jan 26, 2013 at 1:42 PM, Alex Shinn alexsh...@gmail.com wrote:

 On Wed, Jan 23, 2013 at 5:09 PM, Alex Shinn alexsh...@gmail.com wrote:

 On Wed, Jan 23, 2013 at 3:45 PM, Ivan Raikov ivan.g.rai...@gmail.comwrote:

 Yes, I ran into this when I was adding UTF-8 support to mbox... If you
 were to add wide char support in srfi-14, is there a way to quantify the
 performance penalty?


 To add the bounds check so it doesn't error?  Practically
 nothing.

 To branch to a separate path for a wide-char table if
 the bounds check fails?  Same cost if the input is ASCII.

 For efficient handling in the case of Unicode input...
 how small/fast do you want it?


 I've never met such stony silence in response to an offer to do work...

 I ran the following simple char-set-contains? benchmark with
 a few variations:

   (time
(do ((i 0 (+ i 1)))
((= i 1))
(do ((j 0 (+ j 1)))
((= j 256))
  (char-set-contains? char-set:letter (integer-char j)

 This is what most people are concerned about for speed, as
 the boolean and construction operations are less common.

 The results:

 ;; reference implementation
 ;; 0.312s CPU time, 1/2059 GCs (major/minor)

 ;; fixed reference implementation (no error but no support for
 non-latin-1)
 ;; 0.257s CPU time, 1/1706 GCs (major/minor)

 ;; utf8-srfi-14 with full Unicode char-set:letter
 ;; 0.243s CPU time, 0/1526 GCs (major/minor)

 ;; utf8-srfi-14 with ASCII-only char-set:letter
 ;; 0.242s CPU time, 0/1526 GCs (major/minor)

 I was able to add the check and make the reference
 implementation faster because I fixed the common case -
 it was optimized for checking for 0 instead of 1.

 Even with the enormous and complex definition of a
 Unicode letter, utf8-srfi-14 is faster than srfi-14.

 As for what we want in Chicken, the answer depends
 on what you're optimizing for.  utf8-srfi-14 will always
 win for space, and generally for speed as well.

 If the biggest concern is code-size, then you might want
 to borrow the char-set definition from irregex and use
 that as a fallback for non-latin-1 chars in the srfi-14
 reference impl.  This would have the same perf as
 srfi-14 for latin-1, yet still support full Unicode and not
 increase the size of the Chicken distribution.

 --
 Alex



___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users