[Chicken-users] utf8 and string-ref performance
I'm reaching a point where my PEG parser, gentufa'i[1], is going to be ready to tag version 1.0. I have an issue that I've been putting off that I would like some input on. If possible, I would like to parse utf8 input. I currently have utf8 enabled in my egg. gentufa'i works by storing the entire input port in a string, and ceating position objects to refer to the rest of the string as I parse. This means I need to perform the following: 1) reference a character by index 2) compare a character, string, or regular expression starting at an index. Without utf8, step 1 is O(1). With utf8 enabled, that step becomes O(n). step 2 is also more expensive with utf8, as I have to pay that same O(n) to get to the correct index, and I suspect I have some O(n*1) operations that become O(n*m), namely character class comparisons. I think that means step 2 becomes O(x*y*z) rather than O(1*y*1), (x=index, y=string comparison, z=character class comparison) but don't hold me to that! I suspect there is a way to avoid the O(n) penalty for step 1, but I'm uncertain how to do it. I have some patterns to the way I index, all of which are contained the position object in my code[2]: 1) I increment the index position by one character 2) I increment the index position by the length of the string I just succeeded at matching. Essentially, I take characters off the front of the string as I parse, with the caveat that PEG parsers support full backtracking, so I sometimes retrieve previous position objects and work from there--I can't just throw away the prefix of the string I've matched. Can anyone point me in the right direction? Also, I'm not 100% sure what the utf8 gets me, compared to treating the string like binary data. I suspect it would work if I had a utf8 input file, a utf8 string to match, but that I compare them as binary data. I couldn't compare a utf8 *character*, like #\¿, but I think I could compare ¿. (Those are both inverted question marks, if you don't have utf8 e-mail support) Am I wrong about that? Thank you for your help. 1: http://wiki.call-cc.org/eggref/4/genturfahi 2: http://bugs.call-cc.org/browser/release/4/genturfahi/trunk/lerfu-porsi.scm -Alan -- .i ko djuno fi le do sevzi ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
Am Mittwoch, den 24.11.2010, 08:37 -0700 schrieb Alan Post: Can anyone point me in the right direction? I'll paste an example from my code, because an example is sometime better than a lyrics. If you where to keep the byte offset (called start in the code down there) with your position objects, you get O(1) access to the next utf8 character. (define utf8-seek0 (foreign-lambda* integer ((scheme-object str) ; the utf8 encode object (integer sl); it's length in byte (integer start) ; byte offset into str (integer index) ; utf8 char offset into str (integer pos) ; seek-to position ( index) ) ; returns next start position #EOF unsigned char *s=(unsigned char *)C_c_string(str); unsigned char *scan=s+start; unsigned char *limit=s+sl; if( index pos) { while (index pos) { if( scan = limit ) { return(-1); /* raise_error( make3( TLREF(0), NIL_OBJ, make_string(index out of bounds), int2fx(pos) ) ); */ } ++index; if (*scan 0x80) scan++; else if (*scan 0xE0) scan+=2; else if (*scan 0xF0) scan+=3; else if (*scan 0xF8) scan+=4; else if (*scan 0xFC) scan+=5; else if (*scan 0xFE) scan+=6; else return(-2); } return(scan-s); } else if(index pos) { int size=0; while( index pos ) { if( s limit ) { return(-1); } do { size++; limit--; if( s limit || size 6 ) { return(-2); } } while((*limit = 0x80) (*limit 0xC0)); index--; } return(limit-s); } else { return(start); } EOF )) (define (utf8-seek s start index pos) (let ((v (utf8-seek0 s (string-length s) start index pos))) (if (fx= v 0) v (raise (case v ((-1) index out of bounds) ((-2) bad string) (else 'utf8-seek)) Here we still pay the O(n) penalty, because we really require random access and there is to the best of my knowledge no way around, except that you where to parse the utf8 sequence into a vector of strings of one utf8 char in size. (Which looks kinda prohibitive expensive). (define (utf8-substring str from to) (let ((start-offset (utf8-seek str 0 0 from))) (substring str start-offset (utf8-seek str start-offset from to (define (utf8-string-ref str index) (utf8-string-getc str (utf8-seek str 0 0 index))) ;; Return the character at byte offset 'start' from the source string ;; (e.g., of a string port) and it's length. (define utf8-string-getc* (foreign-lambda* integer ((scheme-object str) (integer sl) (integer start) ((c-pointer integer) rsize)) #EOF unsigned char *s=C_c_string(str); unsigned char *scan=s+start; unsigned char *limit=s+sl; unsigned int i, size=1, ch; if (*scan 0x80) ch=*scan; else if (*scan 0xE0) {size=2; ch=*scan 0x1F;} else if (*scan 0xF0) {size=3; ch=*scan 0x0F;} else if (*scan 0xF8) {size=4; ch=*scan 0x07;} else if (*scan 0xFC) {size=5; ch=*scan 0x3;} else if (*scan 0xFE) {size=6; ch=*scan 0x1;} else return(-1); /* ch=0, raise_error( make3( TLREF(0), NIL_OBJ, make_string(bad character size), int2fx(scan-s) ) ); */ if( scan++ + size limit ) return(-2); /*raise_error( make3( TLREF(0), NIL_OBJ, make_string(short character), int2fx(scan-s) ) );*/ for(i=size-1; i ;--i) { if ((*scan0x80) || (*scan = 0xC0)) return(-3); /* raise_error( make3( TLREF(0), NIL_OBJ, make_string(bad byte), int2fx(scan-s) ) );*/ else { ch=(ch6) | (*scan++ 0x3F); } } *rsize=size; return(ch); EOF )) (define (utf8-string-getc str start) (let-location ((size integer)) (let ((c (utf8-string-getc* str (string-length str) start (location size (integer-char c (define open-utf8-input-string (let ([make-input-port make-input-port] [string-length string-length]) (lambda (str) (let ((index 0)) (let-location ((size integer)) (make-input-port (lambda () (if (fx index (string-length str)) (let ((c (utf8-string-getc* str (string-length str) index (location size (set! index (fx+ index size)) (integer-char c)) #!eof)) (lambda () (fx index (string-length str))) (lambda () #t) (lambda () (if (fx index (string-length str)) (let ((c (utf8-string-getc* str (string-length str) index (location size (integer-char c)) #!eof (define (call-with-utf8-input-string str proc) (proc (open-utf8-input-string str))) (define open-utf8-output-string open-output-string) (define (call-with-utf8-output-string proc) (let ((port (open-utf8-output-string))) (proc port) (close-output-port port)))
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 08:37:37AM -0700, Alan Post wrote: gentufa'i works by storing the entire input port in a string, and ceating position objects to refer to the rest of the string as I parse. This means I need to perform the following: 1) reference a character by index 2) compare a character, string, or regular expression starting at an index. Are you sure you need this? If I understood the sentence above the list correctly, it might be enough to use string-list and then work with a list of characters. This can be done pretty fast, and you can store pointers into arbitrary places of the input simply by storing the relevant cons cell. Without utf8, step 1 is O(1). With utf8 enabled, that step becomes O(n). step 2 is also more expensive with utf8, as I have to pay that same O(n) to get to the correct index, and I suspect I have some O(n*1) operations that become O(n*m), namely character class comparisons. utf8 uses the fantastic iset egg for dealing with character sets. membership tests are as fast as O(1) for short ranges, and become O(log(n)) for longer ones, where n is the number of ranges it needs to store. My knowledge of complexity theory is a little rusty so I'm probably describing this in a very roundabout fashion, but the point is that you shouldn't worry about charset performance. It'll be slower than built-in srfi-14 (which is insanely fast - because it's a damn hack :P) but not by too much. Essentially, I take characters off the front of the string as I parse, with the caveat that PEG parsers support full backtracking, so I sometimes retrieve previous position objects and work from there--I can't just throw away the prefix of the string I've matched. With cons cells you should be able to implement this efficiently enough. We don't have anything like string-pointers which can store arbitrary indices in a string AFAIK. That would be useful to have, I guess. Also, I'm not 100% sure what the utf8 gets me, compared to treating the string like binary data. I suspect it would work if I had a utf8 input file, a utf8 string to match, but that I compare them as binary data. I couldn't compare a utf8 *character*, like #\¿, but I think I could compare ¿. (Those are both inverted question marks, if you don't have utf8 e-mail support) Am I wrong about that? No, that's correct. It's only when you start operating on the character level (or when splitting strings for example) that utf8 makes a difference. HTH, Peter -- http://sjamaan.ath.cx -- The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music. -- Donald Knuth ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
Am Mittwoch, den 24.11.2010, 08:37 -0700 schrieb Alan Post: Can anyone point me in the right direction? I'll paste an example from my code, because an example is sometime better than a lyrics. If you where to keep the byte offset (called start in the code down there) with your position objects, you get O(1) access to the next utf8 character. (define utf8-seek0 (foreign-lambda* integer ((scheme-object str) ; the utf8 encode object (integer sl) ; it's length in byte (integer start) ; byte offset into str (integer index) ; utf8 char offset into str (integer pos) ; seek-to position ( index) ) ; returns next start position #EOF unsigned char *s=(unsigned char *)C_c_string(str); unsigned char *scan=s+start; unsigned char *limit=s+sl; if( index pos) { while (index pos) { if( scan = limit ) { return(-1); /* raise_error( make3( TLREF(0), NIL_OBJ, make_string(index out of bounds), int2fx(pos) ) ); */ } ++index; if (*scan 0x80) scan++; else if (*scan 0xE0) scan+=2; else if (*scan 0xF0) scan+=3; else if (*scan 0xF8) scan+=4; else if (*scan 0xFC) scan+=5; else if (*scan 0xFE) scan+=6; else return(-2); } return(scan-s); } else if(index pos) { int size=0; while( index pos ) { if( s limit ) { return(-1); } do { size++; limit--; if( s limit || size 6 ) { return(-2); } } while((*limit = 0x80) (*limit 0xC0)); index--; } return(limit-s); } else { return(start); } EOF )) (define (utf8-seek s start index pos) (let ((v (utf8-seek0 s (string-length s) start index pos))) (if (fx= v 0) v (raise (case v ((-1) index out of bounds) ((-2) bad string) (else 'utf8-seek)) Here we still pay the O(n) penalty, because we really require random access and there is to the best of my knowledge no way around, except that you where to parse the utf8 sequence into a vector of strings of one utf8 char in size. (Which looks kinda prohibitive expensive). (define (utf8-substring str from to) (let ((start-offset (utf8-seek str 0 0 from))) (substring str start-offset (utf8-seek str start-offset from to (define (utf8-string-ref str index) (utf8-string-getc str (utf8-seek str 0 0 index))) ;; Return the character at byte offset 'start' from the source string ;; (e.g., of a string port) and it's length. (define utf8-string-getc* (foreign-lambda* integer ((scheme-object str) (integer sl) (integer start) ((c-pointer integer) rsize)) #EOF unsigned char *s=C_c_string(str); unsigned char *scan=s+start; unsigned char *limit=s+sl; unsigned int i, size=1, ch; if (*scan 0x80) ch=*scan; else if (*scan 0xE0) {size=2; ch=*scan 0x1F;} else if (*scan 0xF0) {size=3; ch=*scan 0x0F;} else if (*scan 0xF8) {size=4; ch=*scan 0x07;} else if (*scan 0xFC) {size=5; ch=*scan 0x3;} else if (*scan 0xFE) {size=6; ch=*scan 0x1;} else return(-1); /* ch=0, raise_error( make3( TLREF(0), NIL_OBJ, make_string(bad character size), int2fx(scan-s) ) ); */ if( scan++ + size limit ) return(-2); /*raise_error( make3( TLREF(0), NIL_OBJ, make_string(short character), int2fx(scan-s) ) );*/ for(i=size-1; i ;--i) { if ((*scan0x80) || (*scan = 0xC0)) return(-3); /* raise_error( make3( TLREF(0), NIL_OBJ, make_string(bad byte), int2fx(scan-s) ) );*/ else { ch=(ch6) | (*scan++ 0x3F); } } *rsize=size; return(ch); EOF )) (define (utf8-string-getc str start) (let-location ((size integer)) (let ((c (utf8-string-getc* str (string-length str) start (location size (integer-char c (define open-utf8-input-string (let ([make-input-port make-input-port] [string-length string-length]) (lambda (str) (let ((index 0)) (let-location ((size integer)) (make-input-port (lambda () (if (fx index (string-length str)) (let ((c (utf8-string-getc* str (string-length str) index (location size (set! index (fx+ index size)) (integer-char c)) #!eof)) (lambda () (fx index (string-length str))) (lambda () #t) (lambda () (if (fx index (string-length str)) (let ((c (utf8-string-getc* str (string-length str) index (location size (integer-char c)) #!eof (define (call-with-utf8-input-string str proc) (proc (open-utf8-input-string str))) (define open-utf8-output-string open-output-string) (define (call-with-utf8-output-string proc) (let ((port (open-utf8-output-string))) (proc port) (close-output-port port))) ;; Tell the string index of a byte offset into an utf8 encoded string. ;; Reverse to utf8-seek. (define utf8-tell0 (foreign-lambda* integer ((scheme-object str) (integer sl) (integer start) (integer index) (integer pos)) #EOF unsigned char *s=(unsigned char *)C_c_string(str); unsigned char *scan=s+start; unsigned char *limit=scan+sl; if( s+pos limit ) {
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 05:05:18PM +0100, Peter Bex wrote: On Wed, Nov 24, 2010 at 08:37:37AM -0700, Alan Post wrote: gentufa'i works by storing the entire input port in a string, and ceating position objects to refer to the rest of the string as I parse. This means I need to perform the following: 1) reference a character by index 2) compare a character, string, or regular expression starting at an index. Are you sure you need this? If I understood the sentence above the list correctly, it might be enough to use string-list and then work with a list of characters. This can be done pretty fast, and you can store pointers into arbitrary places of the input simply by storing the relevant cons cell. I will play with this, as you're correct I could work with lists rather that strings. I'm using irregex for character class matching. It looks like I should be using srfi-14/utf8+iset instead. Do those work only on the character level, am I missing a string version of those? I see char-set-contains? for which I can determine whether a character is in the class, but I usually want to compare several characters in a row, as in I want to match the input until something isn't in the character class. -Alan -- .i ko djuno fi le do sevzi ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 05:05:38PM +0100, Jörg F. Wittenberger wrote: Am Mittwoch, den 24.11.2010, 08:37 -0700 schrieb Alan Post: Can anyone point me in the right direction? I'll paste an example from my code, because an example is sometime better than a lyrics. If you where to keep the byte offset (called start in the code down there) with your position objects, you get O(1) access to the next utf8 character. The code is obviously a lot to digest. I'll thank you for providing it before I'm able to look at it, as I'd otherwise be sitting here a couple days without you hearing from me! Thank you, -Alan -- .i ko djuno fi le do sevzi ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 09:33:24AM -0700, Alan Post wrote: I'm using irregex for character class matching. The Irregex in experimental is reasonably fast for charsets, giving O(log(n)) performance for charsets membership checking. If the charset is continuous (ie, with no gaps) it's actually O(1). It's much less efficient than iset on fragmented character sets, but on huge unbroken character sets it can be faster. It stores vectors of cons cells which hold the start/end ranges of subranges within the character set, whereas iset stores small bit-vectors for subranges, stored in a btree. It looks like I should be using srfi-14/utf8+iset instead. Do those work only on the character level, am I missing a string version of those? SRFI-14 is for dealing with characters. I see char-set-contains? for which I can determine whether a character is in the class, but I usually want to compare several characters in a row, as in I want to match the input until something isn't in the character class. Then irregex might actually be the best way to go about it since that can compile matchers for charset overlaps in alternatives in a smart way. Cheers, Peter -- http://sjamaan.ath.cx -- The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music. -- Donald Knuth ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
From: Peter Bex peter@xs4all.nl Subject: Re: [Chicken-users] utf8 and string-ref performance Date: Wed, 24 Nov 2010 17:05:18 +0100 With cons cells you should be able to implement this efficiently enough. We don't have anything like string-pointers which can store arbitrary indices in a string AFAIK. That would be useful to have, I guess. How should that look? Would locatives be useful here? cheers, felix ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 07:13:10PM +0100, Felix wrote: With cons cells you should be able to implement this efficiently enough. We don't have anything like string-pointers which can store arbitrary indices in a string AFAIK. That would be useful to have, I guess. How should that look? Would locatives be useful here? I'm afraid this is just the shared substring/blob structure proposal in another guise. I don't know if locatives are useful; those can't really be kept around for a long time, can they? Cheers, Peter -- http://sjamaan.ath.cx -- The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music. -- Donald Knuth ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Re: [Chicken Gazette - Issue 13] - ##sys#:keyword:'s
From: Jörg F. Wittenberger joerg.wittenber...@softeyes.net Subject: [Chicken-users] Re: [Chicken Gazette - Issue 13] - ##sys#:keyword:'s Date: Mon, 22 Nov 2010 22:47:51 +0100 Am Montag, den 22.11.2010, 21:22 +0100 schrieb Peter Bex: On Saturday another new thread (!) was started by Alan Post in which he reported a bug in Chicken's keyword argument handling. He created a ticket in Trac to help track this bug, but with the help of Alex and Felix he found out it was not a bug in Chicken but in his own code; `string-symbol` does not produce keyword objects even when the string ends with a colon. After he changed his code to use `string-keyword` everything worked as it should. Keywords can be confusing things: they're not quite the same as symbols because they're self-evaluating, yet `symbol?` returns `#t`. May I ask a simple question: what is the actual rational behind keywords (wrt. symbols)? They provide a syntactically and semantically distinct marker for things like argument lists. Are there any good references? Unfortunately not. Others have pointed out DSSSL, you can also check out the bigloo and gambit documentation, but they don't provide much more than the Chicken manual. Could we do away with them? Why? Should we? Boil them down to mere read syntax? ( 'x same as x: ?) That would make them indistinguishable from normal symbols, and thus would make their usage more error-prone, I think. cheers, felix ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Chicken Gazette - Issue 13
Am Dienstag, den 23.11.2010, 09:19 +0100 schrieb Peter Bex: On Tue, Nov 23, 2010 at 12:32:24AM +0100, Jörg F. Wittenberger wrote: == 2. Core development The scrutinizer was updated to give a warning when a one-armed `if` is used in tail-position, as suggested on chicken-users by Jörg Wittenberg. sure? or is -- no, wait, no git struggle now. spare me, please! Just wait for the next dev snapshot. As I said with a smiley in the other posting: I would never enter into a religious war. Therefore I have to raise a flag in favour of the great SQLite data base! Right here! Why?: How does your PostgreSQL handle master-master replication? I've never had use for that so I don't really know. But this wiki page sounds hopeful: http://wiki.postgresql.org/wiki/Replication%2C_Clustering%2C_and_Connection_Pooling#Comparison_matrix I did not find anything about details of the master-master replication there. Though the usual way is, that all changes an any end are replicated to the other one. So MiM will always corrupt your data base. I didn't know SQLite had any replication whatsoever at all. Or did you roll your own? Well, I told you with a grin that this is a letters to the editor. So kind of a me too. Yes, I did. Using chicken. Did you read my linked example code? http://www.askemos.org/Adc5dd0c30f6e63932811ed60e019bb2d/Kalender?date=2010-11-01 It's intended how easy it can be to use a replicated database (which is safe against the MiM attack). Adding yet another replica is as complicated as filling the id into this form (screenshot) http://www.askemos.org/Ab6c588dfa4ed826d7b387f19fbc60f10 If so, you could do that with any database! Maybe you could. SQLite was the only one, for which I found a way to do it. Hooking into it's virtual file system interface. A one-man-show of about 1500 LoC. (Without the actual replication code, which I had already before.) What if I mount a man in the middle attack on one of your master replicas and inject fake update packets? Will I be able to tamper with your data base? Too bad for you!!! ;-) That's what SSL connections (with client certificates) are for. Cheers, Peter Wait, security can be even stronger. What if replica is rooted? Or you got an admin bribed? Or - as in my example code above: each replica owner has a different interest in the database content. Hence a reasonable fear of fraud. /Jörg ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Chicken Gazette - Issue 13
On Wed, Nov 24, 2010 at 06:15:24PM +0100, Jörg F. Wittenberger wrote: http://wiki.postgresql.org/wiki/Replication%2C_Clustering%2C_and_Connection_Pooling#Comparison_matrix I did not find anything about details of the master-master replication there. Though the usual way is, that all changes an any end are replicated to the other one. Well, that's what master-master replication is, isn't it? So MiM will always corrupt your data base. You'd need some kind of MiM protection, or some in-database system which assigns trustworthiness to your data, but that seems like it would be an extension to simple replication. From what I've read some of those systems are based on triggers. I guess you could extend those triggers with your own custom weighing functions or whatever. I'm no expert in this subject matter, so I'm just guessing. I didn't know SQLite had any replication whatsoever at all. Or did you roll your own? Well, I told you with a grin that this is a letters to the editor. So kind of a me too. Yes, I did. Using chicken. Did you read my linked example code? http://www.askemos.org/Adc5dd0c30f6e63932811ed60e019bb2d/Kalender?date=2010-11-01 I keep getting connection refused from that server, so I can't check. Adding yet another replica is as complicated as filling the id into this form (screenshot) http://www.askemos.org/Ab6c588dfa4ed826d7b387f19fbc60f10 Again, connection refused. If so, you could do that with any database! Maybe you could. I have no need for such a system right now :) I just wanted to let you know it's unfair to cite it as an advantage of SQLite if you just hacked it on top since the same could be done with postgres. Except that the work has already been done for sqlite, of course ;) That's what SSL connections (with client certificates) are for. Wait, security can be even stronger. What if replica is rooted? Or you got an admin bribed? That's not exactly a classical MitM situation, is it? How do you deal with that now? Cheers, Peter -- http://sjamaan.ath.cx -- The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music. -- Donald Knuth ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] handling the undefined value
From: Jörg F. Wittenberger joerg.wittenber...@softeyes.net Subject: Re: [Chicken-users] handling the undefined value Date: Mon, 22 Nov 2010 15:08:46 +0100 Have a compiler switch (since it may break some code), which changes the code to return zero values instead of the distinguished undefined value. I don't think this is a great idea: this will change the semantics of code using call-with-values, will be less efficient, and may throw errors in some cases - R5RS (in contrast to CL and R6RS) does not automatically adjust the number of result values to the number of values expected by the location where the result(s) is/are used. cheers, felix ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 7:37 AM, Alan Post alanp...@sunflowerriver.org wrote: If possible, I would like to parse utf8 input. I currently have utf8 enabled in my egg. [...] Can anyone point me in the right direction? Parsing is generally one of the things you get for free with utf8. Probably the only thing you need to do is *remove* the reference to the utf8 egg and everything will work. The effect of this is that parsing will work on bytes instead of characters, but the results will be the same. There may still be corner cases. If the API allows searching for individual characters, you need to check if they are non-ASCII and if so convert them into the relevant utf8 string. Indexes on input and output would be in terms of byte position. If you want to make this char position you have to convert once each on input and output. That's O(n), so no effect on asymptotic performance. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
On Wed, Nov 24, 2010 at 07:15:49PM +0100, Peter Bex wrote: On Wed, Nov 24, 2010 at 07:13:10PM +0100, Felix wrote: With cons cells you should be able to implement this efficiently enough. We don't have anything like string-pointers which can store arbitrary indices in a string AFAIK. That would be useful to have, I guess. How should that look? Would locatives be useful here? I'm afraid this is just the shared substring/blob structure proposal in another guise. I don't know if locatives are useful; those can't really be kept around for a long time, can they? Weeks ago when I posted the thread about using mmap it was because in relationship to this problem, I'm looking for the best performance I can get in the inner loop of my parser, where I'm matching against the input buffer. -Alan -- .i ko djuno fi le do sevzi ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] utf8 and string-ref performance
From: Peter Bex peter@xs4all.nl Subject: Re: [Chicken-users] utf8 and string-ref performance Date: Wed, 24 Nov 2010 19:15:49 +0100 On Wed, Nov 24, 2010 at 07:13:10PM +0100, Felix wrote: With cons cells you should be able to implement this efficiently enough. We don't have anything like string-pointers which can store arbitrary indices in a string AFAIK. That would be useful to have, I guess. How should that look? Would locatives be useful here? I'm afraid this is just the shared substring/blob structure proposal in another guise. I don't know if locatives are useful; those can't really be kept around for a long time, can they? Sorry, I don't understand? They are not invalidated by GC (in case you mean that). cheers, felix ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
[Chicken-users] Can an egg have a library and executable with the same name?
My egg, genturfa'i, has an executable and a library. I've named the library genturfahi and the executable genturfahi-peg. I'd rather name the executable genturfahi too, though I suspect I'm not able to do that. Is this true? If it isn't, can someone point me to an egg that has a library and an executable named after the egg? -Alan -- .i ko djuno fi le do sevzi ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] New eggs: npdiff, format-textdiff
Hi Ivan, It works now. Thanks! One very small request: diff --unified format. Best, Daishi At Fri, 19 Nov 2010 13:42:20 +0900, Ivan Raikov wrote: Hello, Thanks for trying to use format-textdiff. The problem below was actually caused by a bug in npdiff, which has been fixed in npdiff release 1.13. Please update your copy of npdiff and try again. Let me know if you encounter any other issues with those eggs. -Ivan Daishi Kato dai...@axlight.com writes: Hi all, Anybody using format-textdiff? I encountered the following problem: CHICKEN (c)2008-2010 The Chicken Team (c)2000-2007 Felix L. Winkelmann Version 4.6.1 linux-unix-gnu-x86 [ manyargs dload ptables ] compiled 2010-09-25 on lobule (Linux) #;1 (use format-textdiff) #;2 (textdiff (with-input-from-string abc\n read-lines) (with-input-from-string def\n read-lines)) Error: bad argument count - received 4 but expected 6: #procedure Call history: ##sys#call-with-values vector-lib#check-index values make-vector vector-list ##sys#call-with-values vector-lib#check-index values make-vector vector-list-- ** XREA.COM -Free Web Hosting- http://www.xrea.com/ ** ___ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users