Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
On Friday 22 November 2002 16:38, you wrote: Zoran, Here's a reproducible example of what I'm talking about: wats:nscp 75 encoding system iso8859-1 wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î ¾ÆÆ®¹Ìµð¾î wats:nscp 77 set u ¾ÆÆ®¹Ìµð¾î wats:nscp 78 regexp {^(.*)$} $u junk m 1 wats:nscp 79 set m ¾ÃƮ¹̵ð¾î See how $m isn't the same as $u? Using my encoding-aware nscp: lexxsrv:nscp 1 set u ¾ÆÆ®¹Ìµð¾î ¾ÆÆ®¹Ìµð¾î lexxsrv:nscp 2 set u ¾ÆÆ®¹Ìµð¾î lexxsrv:nscp 3 regexp {^(.*)$} $u junk m 1 lexxsrv:nscp 4 set m ¾ÆÆ®¹Ìµð¾î $m is the same as $u Any idea what I'm doing wrong? As already posted to the list: the nscp trashes encoding. But, the initial question was about the memory corruption wasn't it? Somehow you got complaints from the memory allocator which I'm not able to reproduce. Now, what was the corrective measure again? It did have to do something with nsv_* arrays, did it? Cheers Zoran
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
On 2002.11.23, Zoran Vasiljevic [EMAIL PROTECTED] wrote: Using my encoding-aware nscp: [...] $m is the same as $u Very cool! Is there a reason why we wouldn't want your changes to nscp checked into CVS? Any idea what I'm doing wrong? As already posted to the list: the nscp trashes encoding. Right. It appears the ADP processor does, too. Matter of fact, it sounds like a lot of places might be doing it ... (I suspect the DB interface also does it, but I haven't checked.) But, the initial question was about the memory corruption wasn't it? Somehow you got complaints from the memory allocator which I'm not able to reproduce. Now, what was the corrective measure again? It did have to do something with nsv_* arrays, did it? The initial error I was reported was: alloc: invalid block: ff2bb898: ff 70 0 alloc: invalid block: ff2bb898: ff 70 0 There were probably 200,000 nsv arrays created, using about 300 MB of memory. I'm only guessing at these numbers -- I didn't have the opportunity to take actual counts before the thing croaked. After modifying my application to periodically nsv_unset when they are no longer needed, the error has gone away and the server hasn't crashed since. I might have just been hitting some limit somewhere with the number of nsv's ... -- Dossy -- Dossy Shiobara mail: [EMAIL PROTECTED] Panoptic Computer Network web: http://www.panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70)
[AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
(The following is a message I sent to Zoran off-list, but I figured folks from the list might already know the answer, so I'm sending it to the list as well.) Zoran, Here's a reproducible example of what I'm talking about: wats:nscp 75 encoding system iso8859-1 wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î ¾ÆÆ®¹Ìµð¾î wats:nscp 77 set u ¾ÆÆ®¹Ìµð¾î wats:nscp 78 regexp {^(.*)$} $u junk m 1 wats:nscp 79 set m ¾ÃƮ¹̵ð¾î See how $m isn't the same as $u? Also, the default encoding is iso8859-1 ... is this the problem? Doesn't appear to be: wats:nscp 9 encoding system iso8859-1 wats:nscp 10 encoding system utf-8 wats:nscp 11 encoding system utf-8 wats:nscp 12 set u {¾ÆÆ®¹Ìµð¾î} ¾ÆÆ®¹Ìµð¾î wats:nscp 13 regexp {^(.*)$} $u junk m 1 wats:nscp 14 ¾ÃƮ¹̵ð¾î Same behavior. Any idea what I'm doing wrong? -- Dossy -- Dossy Shiobara mail: [EMAIL PROTECTED] Panoptic Computer Network web: http://www.panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70)
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
Another interesting behavior: wats:nscp 20 encoding system utf-8 wats:nscp 21 set u ¾ÆÆ®¹Ìµð¾î wats:nscp 22 set m ¾ÃƮ¹̵ð¾î wats:nscp 23 string compare $u $m 0 Not what I would have expected. -- Dossy On 2002.11.22, Dossy [EMAIL PROTECTED] wrote: (The following is a message I sent to Zoran off-list, but I figured folks from the list might already know the answer, so I'm sending it to the list as well.) Zoran, Here's a reproducible example of what I'm talking about: wats:nscp 75 encoding system iso8859-1 wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î ¾ÆÆ®¹Ìµð¾î wats:nscp 77 set u ¾ÆÆ®¹Ìµð¾î wats:nscp 78 regexp {^(.*)$} $u junk m 1 wats:nscp 79 set m ¾ÃƮ¹̵ð¾î See how $m isn't the same as $u? Also, the default encoding is iso8859-1 ... is this the problem? Doesn't appear to be: wats:nscp 9 encoding system iso8859-1 wats:nscp 10 encoding system utf-8 wats:nscp 11 encoding system utf-8 wats:nscp 12 set u {¾ÆÆ®¹Ìµð¾î} ¾ÆÆ®¹Ìµð¾î wats:nscp 13 regexp {^(.*)$} $u junk m 1 wats:nscp 14 ¾ÃƮ¹̵ð¾î Same behavior. Any idea what I'm doing wrong? -- Dossy -- Dossy Shiobara mail: [EMAIL PROTECTED] Panoptic Computer Network web: http://www.panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70) -- Dossy Shiobara mail: [EMAIL PROTECTED] Panoptic Computer Network web: http://www.panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70)
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
+-- On Nov 22, Dossy said: Any idea what I'm doing wrong? You're typing iso8859-1 into nscp. nscp doesn't use a Tcl channel for input, so it does no charset translation on that input. Hence the system encoding is irrelevant. You must only send UTF-8 to nscp, and you'll only get UTF-8 back.
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
On 2002.11.22, Rob Mayoff [EMAIL PROTECTED] wrote: +-- On Nov 22, Dossy said: Any idea what I'm doing wrong? You're typing iso8859-1 into nscp. nscp doesn't use a Tcl channel for input, so it does no charset translation on that input. Hence the system encoding is irrelevant. You must only send UTF-8 to nscp, and you'll only get UTF-8 back. This doesn't make sense. How do you explain this: wats:nscp 75 encoding system iso8859-1 wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î ¾ÆÆ®¹Ìµð¾î wats:nscp 77 set u ¾ÆÆ®¹Ìµð¾î wats:nscp 78 regexp {^(.*)$} $u junk m 1 wats:nscp 79 set m ¾ÃƮ¹̵ð¾î $u is getting set to what I'd expect it to, but $m isn't. Also, this is reproducible in an ADP page as well. (Actually, that's where the problem I was seeing originally started -- I've just distilled it down via nscp so I could demonstrate what I was seeing in my actual code.) Funny enough, tclsh8.4 does the right thing: % set tcl_patchLevel 8.4.0 % encoding system iso8859-1 % set u {¾ÆÆ®¹Ìµð¾î} ¾ÆÆ®¹Ìµð¾î % regexp {^(.*)$} $u junk m 1 % set m ¾ÆÆ®¹Ìµð¾î -- Dossy -- Dossy Shiobara mail: [EMAIL PROTECTED] Panoptic Computer Network web: http://www.panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70)
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
+-- On Nov 22, Dossy said: This doesn't make sense. How do you explain this: [deletia] $u is getting set to what I'd expect it to, but $m isn't. Tcl stores strings internally in UTF-8. Sometimes it converts strings to UCS-16 (16-bit characters), for example to do regexp matching, and then converts them back to UTF-8. Tcl is careful to make sure it uses only UTF-8 internally by translating all input, via the channel mechanism, to UTF-8. AOLserver blows that care away by handing non-UTF-8 strings to Tcl via C interfaces that were only intended to receive UTF-8. (This is exactly what nscp is doing.) Tcl doesn't look at or modify the contents of the string unless it has reason to. So if you don't do anything to the string via Tcl, nscp gets the string back unchanged, and sends it to you over a raw socket (not a Tcl channel), so you see it unchanged. Hence the $u behavior. As soon as you start to manipulate the string, especially when you do so using something like regexp (which converts the string to UCS-16), you're likely to generate garbage, because the functions that manipulate UTF-8 strings are operating on non-UTF-8 strings. Hence the $m behavior. BTW, this is exactly the same problem that I described in http://dqd.com/~mayoff/encoding-doc.html two years ago. Also, this is reproducible in an ADP page as well. (Actually, that's where the problem I was seeing originally started -- I've just distilled it down via nscp so I could demonstrate what I was seeing in my actual code.) Same thing. The ADP processor doesn't honor the C API's UTF-8 requirements, so sometimes you get garbage. Funny enough, tclsh8.4 does the right thing: Of course. tclsh reads the input via a Tcl channel, so it does charset translation. As I said, nscp doesn't use a Tcl channel and does no charset translation.
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
On Friday 22 November 2002 16:38, you wrote: Any idea what I'm doing wrong? I will double-check this here but I have to agree with Rob. The ncp channel is NOT encoding-aware. You should not interpret (test/make_conclusion/etc) based on typing into the ncp alone. I have an encoding-aware modified ncp which I'm using when debugging such issues. If anybody has the answer ready, the better. If not, I'll try to see what's wrong. Cheers Zoran
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
In a message dated 11/22/2002 11:26:08 AM Eastern Standard Time, [EMAIL PROTECTED] writes: Any idea what I'm doing wrong? I will double-check this here but I have to agree with Rob. The ncp channel is NOT encoding-aware. You should not interpret (test/make_conclusion/etc) based on typing into the ncp alone. I have an encoding-aware modified ncp which I'm using when debugging such issues. If anybody has the answer ready, the better. If not, I'll try to see what's wrong. Agree - it's nscp's very simple read code which eval's strings directly without converting to utf8. I suppose we could assume latin1 input and convert to utf8 or perhaps provide a command to set the nscp encoding. -Jim
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
In a message dated 11/22/2002 11:22:20 AM Eastern Standard Time, [EMAIL PROTECTED] writes: BTW, this is exactly the same problem that I described in http://dqd.com/~mayoff/encoding-doc.html two years ago. ...which, btw, is the guide I used add encoding support to aolserver 3.4 and 4.0. It's a very good review of these issues - I recommend folks read it through it. Note that I didn't implement everything Rob had done - I just added the basics which work for ADP if file extension mappings are setup properly. A good first step for the core team could be to rationalize this stuff. -Jim
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
In a message dated 11/22/02 11:36:17 AM Eastern Standard Time, [EMAIL PROTECTED] writes: Agree - it's nscp's very simple read code which eval's strings directly without converting to utf8. I suppose we could assume latin1 input and convert to utf8 or perhaps provide a command to set the nscp encoding. -Jim Probably need to touch the nscp output as well, to avoid cryptic utf8 sequences on terminals that don't interpret them. Mark
Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)
The ncp channel is NOT encoding-aware. You should not Agree - it's nscp's very simple read code which eval's strings directly without converting to utf8. I suppose we could assume latin1 input and convert to utf8 or perhaps provide a command to set the nscp encoding. Tcl has APIs for this (Tcl_ExternalToUtf* and Tcl_UtfToExternal*). I've not used the nscp so I've never seen this issue, but it's fairly easy to see that Dossy is just getting back the result of the original string encoded in utf-8 (the A-hat's give it away). Jeff Hobbs The Tcl Guy Senior Developer http://www.ActiveState.com/ Tcl Support and Productivity Solutions