Dealing with invalid UTF-8 [was: Re: Faster UTF-8 decoding in GLib]

2010-03-29 Thread Daniel Elstner
Hi Behdad, > Well, there's a bit more to it. Just because some bytes in a file are invalid > acording to the spec doesn't mean your text editor should refuse to open the > file. While g_utf8_get_char() and friends do assume valid UTF-8 data, it's an > unwritten assumption that for invalid bytes

Re: Faster UTF-8 decoding in GLib

2010-03-27 Thread Daniel Elstner
Hi, Am Samstag, den 27.03.2010, 18:04 -0400 schrieb Behdad Esfahbod: > Sure, I wasn't referring to valid data. In valid UTF-8, there is no 5byte or > 6byte sequences either. True, but that was a post-hoc restriction imposed afterwards, when Unicode was redefined as a 21-bit character set, presu

Re: Faster UTF-8 decoding in GLib

2010-03-27 Thread Daniel Elstner
Hi, Am Samstag, den 27.03.2010, 17:40 -0400 schrieb Behdad Esfahbod: > On 03/27/2010 05:21 PM, Daniel Elstner wrote: > > Well, I assume that ints are at least 32 bit wide on any platform > > supported by GLib. But if you meant to say that it would break with > > larger ints,

Re: Faster UTF-8 decoding in GLib

2010-03-27 Thread Daniel Elstner
Hi, Am Samstag, den 27.03.2010, 16:51 -0400 schrieb Behdad Esfahbod: > On 03/27/2010 04:27 PM, Daniel Elstner wrote: > > > > It is not meant to check for errors. > > Good point. > > > I think it is totally arbitrary to handle some potential errors but not > &

Re: Faster UTF-8 decoding in GLib

2010-03-27 Thread Daniel Elstner
Hi, Am Samstag, den 27.03.2010, 16:12 -0400 schrieb Behdad Esfahbod: > Err, you're right. My bad. It's still broken though since it doesn't check > that the fragment bytes all start with the bits 10. Missing error checking. It is not meant to check for errors. I think it is totally arbitrary

Re: Faster UTF-8 decoding in GLib

2010-03-26 Thread Daniel Elstner
Hi again, Am Freitag, den 26.03.2010, 22:43 +0100 schrieb Daniel Elstner: > Am Freitag, den 26.03.2010, 13:25 -0400 schrieb Behdad Esfahbod: > > > * The construct borrowed from glibmm, as beautiful as it is, is WRONG > > for > > 6-byte-long UTF-8. It just doesn&

Re: Faster UTF-8 decoding in GLib

2010-03-26 Thread Daniel Elstner
Hi Behdad, Am Freitag, den 26.03.2010, 13:25 -0400 schrieb Behdad Esfahbod: > * The construct borrowed from glibmm, as beautiful as it is, is WRONG for > 6-byte-long UTF-8. It just doesn't work. We historically support those > sequences. What? In what way exactly is it wrong? --Daniel

Re: Faster UTF-8 decoding in GLib

2010-03-17 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 23:51 +0100 schrieb Mathieu Lacage: > loading offsets are usually randomized once in a while and the whole > system is prelinked with these randomized offsets so that all further > loads do use the same 'random' (per-machine) offset until the next > offset randomi

Re: Faster UTF-8 decoding in GLib

2010-03-17 Thread Daniel Elstner
Hi, Am Mittwoch, den 17.03.2010, 00:17 +0200 schrieb Mikhail Zabaluev: > Yes, though we are already in the buffer overflow territory with all > implementations of g_utf8_get_char considered so far. Only read past the end, thus no security implications beyond a potential for DoS in the unlikely e

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 23:18 +0200 schrieb Mikhail Zabaluev: > Umm. I had the conception of a DSO being one position-independent blob > with all references made relative, even if basic ELF allows different > segments loaded independently. Impossible. There are no relative function poi

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 22:52 +0200 schrieb Mikhail Zabaluev: > I could try that, after I take your one to good internal use where it > already shows more effect. But my current tests do not account for any > hidden costs of inlining longish and branched code. Addendum: It's actually no

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 22:52 +0200 schrieb Mikhail Zabaluev: > I already made some minor changes to restrict what it produces (like, > c & 0x3f is safer than c - 0x80), No -- this was on purpose! Using addition and subtraction here instead of bitwise-and and bitwise-or allows the two

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 21:05 +0200 schrieb Mikhail Zabaluev: > I have tested your solution as applied to mainline g_utf8_get_char(), > and not inlined except any intra-file optimizations. The results are > for ARM this time. [...] > From the looks of it, some lesser oomph from the non-i

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 14:09 -0400 schrieb Behdad Esfahbod: > That's one of the worst ideas as far as software goes. If an operation takes > 1% of your application time and you make it 1000 times faster, you know how > much total faster your application would run? 1.01x faster... Yes,

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 19:49 +0200 schrieb Mikhail Zabaluev: > I'm wary of inlining non-trivial code which has some branching in it, > for the same reasons of cache pressure, killing branch prediction, and > so on. Well yes. That's why I would have liked numbers. :-) In any case, I e

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 19:38 +0200 schrieb Mikhail Zabaluev: > 2010/3/16 Daniel Elstner : > > > Also, do you realize that have just single-handedly introduced 256 (!) > > address references that need to be resolved by the dynamic linker at > > library load

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 13:01 -0400 schrieb Behdad Esfahbod: > > > > I've made a glib branch where I tried to optimize the UTF-8 decoding > > routines: > > http://git.collabora.co.uk/?p=user/zabaluev/glib.git;a=shortlog;h=refs/heads/fast-utf8 > > Before any changes are made, can you pr

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi again, Am Dienstag, den 16.03.2010, 18:47 +0200 schrieb Daniel Elstner: > Am Dienstag, den 16.03.2010, 17:20 +0200 schrieb Mikhail Zabaluev: > > > The new code uses a table of unrolled functions to decode byte > > sequences, dispatched by the first character. g_utf8

Re: Faster UTF-8 decoding in GLib

2010-03-16 Thread Daniel Elstner
Hi, Am Dienstag, den 16.03.2010, 17:20 +0200 schrieb Mikhail Zabaluev: > I've made a glib branch where I tried to optimize the UTF-8 decoding routines: > http://git.collabora.co.uk/?p=user/zabaluev/glib.git;a=shortlog;h=refs/heads/fast-utf8 > > The new code uses a table of unrolled functions to

Re: GLib substr function

2009-06-26 Thread Daniel Elstner
Am Freitag, den 10.04.2009, 14:08 +0200 schrieb Christian Dywan: > For the sake of demonstration, it took me 2 minutes to write a simple > substring function in C that does what you want, have a look how it > works. :) It doesn't. Your function allocates memory using a byte count but then uses t

Re: glib uses wrong prefix for base-2 units

2009-06-04 Thread Daniel Elstner
Am Mittwoch, den 03.06.2009, 23:10 -0700 schrieb Brian J. Tarricone: > On 06/03/2009 05:36 PM, Paul LeoNerd Evans wrote: > > > Yes; we messed up 30 years ago and said "k" when we > > meant "Ki". Oops. Sorry about that. > > Well, no, 30 years ago there was no "Ki". So people did the logical > th

Re: FYI: better UTF8 decoder.

2009-04-29 Thread Daniel Elstner
Am Montag, den 13.04.2009, 21:26 -0400 schrieb Behdad Esfahbod: > On 04/13/2009 05:00 AM, Butrus Damaskus wrote: > > Hi! > > > > This page: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ claims to > > have better (quicker and smaller?) utf8 decoder. Maybe it would be > > worth to look at it? > > Fu

Re: GEvent - Proposal for a new threading structure for GLib

2006-11-17 Thread Daniel Elstner
Am Freitag, den 17.11.2006, 11:57 -0500 schrieb Matthew Barnes: > On Fri, 2006-11-17 at 17:32 +0100, Murray Cumming wrote: > > This seems similar to a class we have in glibmm, Glib::Dispatcher: > > http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1Dispatcher.html#_details > > th