Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-23 Thread Zoran Vasiljevic
On Friday 22 November 2002 16:38, you wrote:

 Zoran,

 Here's a reproducible example of what I'm talking about:

 wats:nscp 75 encoding system
 iso8859-1
 wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î
 ¾ÆÆ®¹Ìµð¾î
 wats:nscp 77 set u
 ¾ÆÆ®¹Ìµð¾î
 wats:nscp 78 regexp {^(.*)$} $u junk m
 1
 wats:nscp 79 set m
 ¾ÆƮ¹̵ð¾î

 See how $m isn't the same as $u?


Using my encoding-aware nscp:

lexxsrv:nscp 1 set u ¾ÆÆ®¹Ìµð¾î
¾ÆÆ®¹Ìµð¾î
lexxsrv:nscp 2 set u
¾ÆÆ®¹Ìµð¾î
lexxsrv:nscp 3 regexp {^(.*)$} $u junk m
1
lexxsrv:nscp 4 set m
¾ÆÆ®¹Ìµð¾î

$m is the same as $u


 Any idea what I'm doing wrong?

As already posted to the list: the nscp trashes encoding.

But, the initial question was about the memory corruption
wasn't it? Somehow you got complaints from the memory
allocator which I'm not able to reproduce. Now, what was
the corrective measure again? It did have to do something
with nsv_* arrays, did it?

Cheers
Zoran



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-23 Thread Dossy
On 2002.11.23, Zoran Vasiljevic [EMAIL PROTECTED] wrote:

 Using my encoding-aware nscp:
[...]
 $m is the same as $u

Very cool!  Is there a reason why we wouldn't want your changes to
nscp checked into CVS?

  Any idea what I'm doing wrong?

 As already posted to the list: the nscp trashes encoding.

Right.  It appears the ADP processor does, too.  Matter of fact,
it sounds like a lot of places might be doing it ... (I suspect
the DB interface also does it, but I haven't checked.)

 But, the initial question was about the memory corruption
 wasn't it? Somehow you got complaints from the memory
 allocator which I'm not able to reproduce. Now, what was
 the corrective measure again? It did have to do something
 with nsv_* arrays, did it?

The initial error I was reported was:

alloc: invalid block: ff2bb898: ff 70 0
alloc: invalid block: ff2bb898: ff 70 0

There were probably 200,000 nsv arrays created, using about 300 MB
of memory.  I'm only guessing at these numbers -- I didn't have the
opportunity to take actual counts before the thing croaked.

After modifying my application to periodically nsv_unset when they are
no longer needed, the error has gone away and the server hasn't crashed
since.  I might have just been hitting some limit somewhere with the
number of nsv's ...

-- Dossy

--
Dossy Shiobara   mail: [EMAIL PROTECTED]
Panoptic Computer Network web: http://www.panoptic.com/
  He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on. (p. 70)



[AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Dossy
(The following is a message I sent to Zoran off-list, but I figured
folks from the list might already know the answer, so I'm sending it to
the list as well.)

Zoran,

Here's a reproducible example of what I'm talking about:

wats:nscp 75 encoding system
iso8859-1
wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î
¾ÆÆ®¹Ìµð¾î
wats:nscp 77 set u
¾ÆÆ®¹Ìµð¾î
wats:nscp 78 regexp {^(.*)$} $u junk m
1
wats:nscp 79 set m
¾ÆƮ¹̵ð¾î

See how $m isn't the same as $u?

Also, the default encoding is iso8859-1 ... is this the problem?
Doesn't appear to be:

wats:nscp 9 encoding system
iso8859-1
wats:nscp 10 encoding system utf-8

wats:nscp 11 encoding system
utf-8
wats:nscp 12 set u {¾ÆÆ®¹Ìµð¾î}
¾ÆÆ®¹Ìµð¾î
wats:nscp 13 regexp {^(.*)$} $u junk m
1
wats:nscp 14
¾ÆƮ¹̵ð¾î

Same behavior.

Any idea what I'm doing wrong?

-- Dossy

--
Dossy Shiobara   mail: [EMAIL PROTECTED]
Panoptic Computer Network web: http://www.panoptic.com/
  He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on. (p. 70)



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Dossy
Another interesting behavior:

wats:nscp 20 encoding system
utf-8
wats:nscp 21 set u
¾ÆÆ®¹Ìµð¾î
wats:nscp 22 set m
¾ÆƮ¹̵ð¾î
wats:nscp 23 string compare $u $m
0

Not what I would have expected.

-- Dossy


On 2002.11.22, Dossy [EMAIL PROTECTED] wrote:
 (The following is a message I sent to Zoran off-list, but I figured
 folks from the list might already know the answer, so I'm sending it to
 the list as well.)

 Zoran,

 Here's a reproducible example of what I'm talking about:

 wats:nscp 75 encoding system
 iso8859-1
 wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î
 ¾ÆÆ®¹Ìµð¾î
 wats:nscp 77 set u
 ¾ÆÆ®¹Ìµð¾î
 wats:nscp 78 regexp {^(.*)$} $u junk m
 1
 wats:nscp 79 set m
 ¾ÆƮ¹̵ð¾î

 See how $m isn't the same as $u?

 Also, the default encoding is iso8859-1 ... is this the problem?
 Doesn't appear to be:

 wats:nscp 9 encoding system
 iso8859-1
 wats:nscp 10 encoding system utf-8

 wats:nscp 11 encoding system
 utf-8
 wats:nscp 12 set u {¾ÆÆ®¹Ìµð¾î}
 ¾ÆÆ®¹Ìµð¾î
 wats:nscp 13 regexp {^(.*)$} $u junk m
 1
 wats:nscp 14
 ¾ÆƮ¹̵ð¾î

 Same behavior.

 Any idea what I'm doing wrong?

 -- Dossy

 --
 Dossy Shiobara   mail: [EMAIL PROTECTED]
 Panoptic Computer Network web: http://www.panoptic.com/
   He realized the fastest way to change is to laugh at your own
 folly -- then you can let go and quickly move on. (p. 70)

--
Dossy Shiobara   mail: [EMAIL PROTECTED]
Panoptic Computer Network web: http://www.panoptic.com/
  He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on. (p. 70)



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Rob Mayoff
+-- On Nov 22, Dossy said:
 Any idea what I'm doing wrong?

You're typing iso8859-1 into nscp. nscp doesn't use a Tcl channel for
input, so it does no charset translation on that input. Hence the system
encoding is irrelevant. You must only send UTF-8 to nscp, and you'll
only get UTF-8 back.



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Dossy
On 2002.11.22, Rob Mayoff [EMAIL PROTECTED] wrote:
 +-- On Nov 22, Dossy said:
  Any idea what I'm doing wrong?

 You're typing iso8859-1 into nscp. nscp doesn't use a Tcl channel for
 input, so it does no charset translation on that input. Hence the system
 encoding is irrelevant. You must only send UTF-8 to nscp, and you'll
 only get UTF-8 back.

This doesn't make sense.  How do you explain this:

wats:nscp 75 encoding system
iso8859-1
wats:nscp 76 set u ¾ÆÆ®¹Ìµð¾î
¾ÆÆ®¹Ìµð¾î
wats:nscp 77 set u
¾ÆÆ®¹Ìµð¾î
wats:nscp 78 regexp {^(.*)$} $u junk m
1
wats:nscp 79 set m
¾ÆƮ¹̵ð¾î

$u is getting set to what I'd expect it to, but $m isn't.

Also, this is reproducible in an ADP page as well.  (Actually,
that's where the problem I was seeing originally started -- I've
just distilled it down via nscp so I could demonstrate what I
was seeing in my actual code.)

Funny enough, tclsh8.4 does the right thing:

% set tcl_patchLevel
8.4.0
% encoding system
iso8859-1
% set u {¾ÆÆ®¹Ìµð¾î}
¾ÆÆ®¹Ìµð¾î
% regexp {^(.*)$} $u junk m
1
% set m
¾ÆÆ®¹Ìµð¾î

-- Dossy

--
Dossy Shiobara   mail: [EMAIL PROTECTED]
Panoptic Computer Network web: http://www.panoptic.com/
  He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on. (p. 70)



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Rob Mayoff
+-- On Nov 22, Dossy said:
 This doesn't make sense.  How do you explain this:

[deletia]

 $u is getting set to what I'd expect it to, but $m isn't.

Tcl stores strings internally in UTF-8. Sometimes it converts strings to
UCS-16 (16-bit characters), for example to do regexp matching, and then
converts them back to UTF-8. Tcl is careful to make sure it uses only
UTF-8 internally by translating all input, via the channel mechanism, to
UTF-8.

AOLserver blows that care away by handing non-UTF-8 strings to Tcl via
C interfaces that were only intended to receive UTF-8. (This is exactly
what nscp is doing.) Tcl doesn't look at or modify the contents of the
string unless it has reason to. So if you don't do anything to the
string via Tcl, nscp gets the string back unchanged, and sends it to you
over a raw socket (not a Tcl channel), so you see it unchanged.  Hence
the $u behavior.

As soon as you start to manipulate the string, especially when you do
so using something like regexp (which converts the string to UCS-16),
you're likely to generate garbage, because the functions that manipulate
UTF-8 strings are operating on non-UTF-8 strings. Hence the $m behavior.

BTW, this is exactly the same problem that I described in
http://dqd.com/~mayoff/encoding-doc.html two years ago.

 Also, this is reproducible in an ADP page as well.  (Actually,
 that's where the problem I was seeing originally started -- I've
 just distilled it down via nscp so I could demonstrate what I
 was seeing in my actual code.)

Same thing. The ADP processor doesn't honor the C API's UTF-8
requirements, so sometimes you get garbage.

 Funny enough, tclsh8.4 does the right thing:

Of course.  tclsh reads the input via a Tcl channel, so it does charset
translation.  As I said, nscp doesn't use a Tcl channel and does no
charset translation.



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Zoran Vasiljevic
On Friday 22 November 2002 16:38, you wrote:


 Any idea what I'm doing wrong?


I will double-check this here but I have to agree with Rob.
The ncp channel is NOT encoding-aware. You should not
interpret (test/make_conclusion/etc) based on typing into
the ncp alone. I have an encoding-aware modified ncp which
I'm using when debugging such issues.
If anybody has the answer ready, the better. If not, I'll
try to see what's wrong.

Cheers
Zoran



Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Jim Davidson
In a message dated 11/22/2002 11:26:08 AM Eastern Standard Time, [EMAIL PROTECTED] writes:


Any idea what I'm doing wrong?


I will double-check this here but I have to agree with Rob.
The ncp channel is NOT encoding-aware. You should not
interpret (test/make_conclusion/etc) based on typing into
the ncp alone. I have an encoding-aware modified ncp which
I'm using when debugging such issues.
If anybody has the answer ready, the better. If not, I'll
try to see what's wrong.



Agree - it's nscp's very simple read code which eval's strings directly without converting to utf8. I suppose we could assume latin1 input and convert to utf8 or perhaps provide a command to set the nscp encoding.

-Jim


Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Jim Davidson
In a message dated 11/22/2002 11:22:20 AM Eastern Standard Time, [EMAIL PROTECTED] writes:

BTW, this is exactly the same problem that I described in
http://dqd.com/~mayoff/encoding-doc.html two years ago.


...which, btw, is the guide I used add encoding support to aolserver 3.4 and 4.0. It's a very good review of these issues - I recommend folks read it through it. Note that I didn't implement everything Rob had done - I just added the basics which work for ADP if file extension mappings are setup properly. A good first step for the core team could be to rationalize this stuff.

-Jim


Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread No Name
In a message dated 11/22/02 11:36:17 AM Eastern Standard Time, [EMAIL PROTECTED] writes:



Agree - it's nscp's very simple read code which eval's strings directly without converting to utf8. I suppose we could assume latin1 input and convert to utf8 or perhaps provide a command to set the nscp encoding.

-Jim


Probably need to touch the nscp output as well, to avoid cryptic utf8 sequences on terminals that don't interpret them.

Mark


Re: [AOLSERVER] high ASCII in regexp (AOLserver 3.5.1 tcl8.4.1)

2002-11-22 Thread Jeff Hobbs
  The ncp channel is NOT encoding-aware. You should not

 Agree - it's nscp's very simple read code which eval's strings
 directly without converting to utf8.  I suppose we could assume
 latin1 input and convert to utf8 or perhaps provide a command to
 set the nscp encoding.

Tcl has APIs for this (Tcl_ExternalToUtf* and Tcl_UtfToExternal*).
I've not used the nscp so I've never seen this issue, but it's
fairly easy to see that Dossy is just getting back the result of
the original string encoded in utf-8 (the A-hat's give it away).

  Jeff Hobbs The Tcl Guy
  Senior Developer   http://www.ActiveState.com/
  Tcl Support and Productivity Solutions