[AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-26 Thread Zoran Vasiljevic
Hi !

After some days... (yes, it is *always* very costly and difficult)
I've pinpointed a large hole in Tcl8.4.(1|2) which effectively
generates bogus OS paths, corrupts memory and otherwise impairs
the AOLserver (or any other MT-enabled application).

The problem is in Tcl generic/tclIOUtil.c and naive handling
of static Tcl_Obj *cwdPathPtr.  The pointer to this Tcl object
gets shuffled arround threads by simple reference, it is
read (referenced) without proper locks, etc.
The implementor obviously protected the most obvious write
operations, but neglected any others. Also, the Rule#1 in
Tcl "Do not pass Tcl_Obj's between threads" is grossly violated.

I must see how we can solve this. I'm afraid that we'd need
to rewrite some parts of the above file and bump to Tcl8.4.3
or such.

How come nobody has noticed this so far?
Well, the problem starts displaying itself if you ever change
the current directory of the process *after* the Tcl has been
initialized. You need not do [cd] explicitly; some internal
Tcl code does that on your behalf as well.

As for AOLserver 4.0; we do need at least Tcl8.4 or higher.
So, I'm afraid we will not be able to go to production
before fixing this ugly thing.


Cheers,
Zoran


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-26 Thread Vlad Seryakov
Yes, i am currently fighting with the same problem, under big load
when i do many calles to Tcl [file] command aolserver crashes
consistently in
static void FreeFsPathInternalRep(pathObjPtr),
but i do not do [cd], just [file mtime,stat,file]
When i do not do Tcl [file] command everything is working fine.

Zoran Vasiljevic wrote:
Hi !

After some days... (yes, it is *always* very costly and difficult)
I've pinpointed a large hole in Tcl8.4.(1|2) which effectively
generates bogus OS paths, corrupts memory and otherwise impairs
the AOLserver (or any other MT-enabled application).
The problem is in Tcl generic/tclIOUtil.c and naive handling
of static Tcl_Obj *cwdPathPtr.  The pointer to this Tcl object
gets shuffled arround threads by simple reference, it is
read (referenced) without proper locks, etc.
The implementor obviously protected the most obvious write
operations, but neglected any others. Also, the Rule#1 in
Tcl "Do not pass Tcl_Obj's between threads" is grossly violated.
I must see how we can solve this. I'm afraid that we'd need
to rewrite some parts of the above file and bump to Tcl8.4.3
or such.
How come nobody has noticed this so far?
Well, the problem starts displaying itself if you ever change
the current directory of the process *after* the Tcl has been
initialized. You need not do [cd] explicitly; some internal
Tcl code does that on your behalf as well.
As for AOLserver 4.0; we do need at least Tcl8.4 or higher.
So, I'm afraid we will not be able to go to production
before fixing this ugly thing.
Cheers,
Zoran
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/
--
Vlad Seryakov
703 961-5433 office
[EMAIL PROTECTED]
http://www.crystalballinc.com/vlad/
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-26 Thread Tom Jackson
Zoran Vasiljevic wrote:

How come nobody has noticed this so far?
Well, the problem starts displaying itself if you ever change
the current directory of the process *after* the Tcl has been
initialized. You need not do [cd] explicitly; some internal
Tcl code does that on your behalf as well.

I got chastized badly for using cd in AOLserver. Something like: "Wake
up, this is an MT app."  What internal code uses cd. Is it generally a
bad thing to do, aside from the actual bug it uncovered?
--Tom Jackson

--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-26 Thread Zoran Vasiljevic
On Wednesday 26 March 2003 19:15, you wrote:
> Yes, i am currently fighting with the same problem, under big load
> when i do many calles to Tcl [file] command aolserver crashes
> consistently in
> static void FreeFsPathInternalRep(pathObjPtr),
> but i do not do [cd], just [file mtime,stat,file]
>
> When i do not do Tcl [file] command everything is working fine.
>

See? That's what I say. You need not call [cd] directly.
It gets called under your feet.

Cheers,
Zoran


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-26 Thread Zoran Vasiljevic
On Wednesday 26 March 2003 19:12, you wrote:
> Zoran Vasiljevic wrote:
> >How come nobody has noticed this so far?
> >Well, the problem starts displaying itself if you ever change
> >the current directory of the process *after* the Tcl has been
> >initialized. You need not do [cd] explicitly; some internal
> >Tcl code does that on your behalf as well.
>
> I got chastized badly for using cd in AOLserver. Something like: "Wake
> up, this is an MT app."  What internal code uses cd. Is it generally a
> bad thing to do, aside from the actual bug it uncovered?
>

You can't just avoid it. Tcl uses [cd] in some places internaly
when doing recursive directory deletions and such. Some other
code (not yours) might also do this, to. For example, to resolve
the relative filepath. List of possible hidden uses is long.

I won't say that it is a bad thing to [cd]. You just have to
know this and program accordingly (i.e. never use relative
paths, for example). But, often, you can't avoid it.

The problem with the described bug is that it also corrupts
memory which is far more dangerous. Ah, wrong OS paths are
dangerous as well... Your process does not break, but you
might affect some unwanted files/data. Hm, a bad thing really.

Cheers,
Zoran


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-26 Thread Andrew Piskorski
On Wed, Mar 26, 2003 at 07:19:08PM +0100, Zoran Vasiljevic wrote:
> > Zoran Vasiljevic wrote:

> > >Well, the problem starts displaying itself if you ever change
> > >the current directory of the process *after* the Tcl has been
> > >initialized. You need not do [cd] explicitly; some internal
> > >Tcl code does that on your behalf as well.

> You can't just avoid it. Tcl uses [cd] in some places internaly
> when doing recursive directory deletions and such. Some other
> code (not yours) might also do this, to. For example, to resolve
> the relative filepath. List of possible hidden uses is long.

> The problem with the described bug is that it also corrupts
> memory which is far more dangerous. Ah, wrong OS paths are

What if, in an ideal world, all uses of "cd" should be made
transparently mt-safe by making all tracking of pwd (current working
directory) state use thread local storage rather than a process-wide
environment variable.  I have not looked at the sources to see how
feasible or unfeasable this would be, but, questions:

1. Are there ANY cases where a TLS (thread local storage) cd/pwd would
be the WRONG thing?  E.g., could we reasonally expect there to be any
cases where one thread A does "pwd", gets "/home/my", does "cd /foo",
and then thread B does "pwd" and EXPECTS to get "/foo" back as the
result?  Any such cases in Tcl?  In AOLserver?  In libgcc?  Anywhere
in any C libraries whatsoever?

2. At what level would this hypothetical TLS cd/pwd implementation
need to be inserted?  Are there OS-provided library functions that
themselves use cd/pwd, and would thus need to be overriden or
otherwise worked around?  Or do Tcl and AOLserver have their own entry
points to all C function calls using cd/pwd, and we could implement
the TLS stuff there?

--
Andrew Piskorski <[EMAIL PROTECTED]>
http://www.piskorski.com


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-27 Thread Zoran Vasiljevic
On Wednesday 26 March 2003 20:03, you wrote:

> 1. Are there ANY cases where a TLS (thread local storage) cd/pwd would
> be the WRONG thing?  E.g., could we reasonally expect there to be any
> cases where one thread A does "pwd", gets "/home/my", does "cd /foo",
> and then thread B does "pwd" and EXPECTS to get "/foo" back as the
> result?  Any such cases in Tcl?  In AOLserver?  In libgcc?  Anywhere
> in any C libraries whatsoever?

I did not understand.
The scenario you've used (thread A/B) is what *is* usually happening.
The thread B gets "/foo" and this is what is expected.
But, this is not the source of the problem. The source of the problem
is that Tcl internal code remembers the process-wide current working
directory in an global Tcl object and neglects to properly lock access
to this object for ALL operations, in addition to blindly passing the
same object to other threads.

We'd need to talk to the Tcl FS implementor and get him/them abandon
the object storage in favour of simple char* or put the object
into the TSD.

> 2. At what level would this hypothetical TLS cd/pwd implementation
> need to be inserted?  Are there OS-provided library functions that
> themselves use cd/pwd, and would thus need to be overriden or
> otherwise worked around?  Or do Tcl and AOLserver have their own entry
> points to all C function calls using cd/pwd, and we could implement
> the TLS stuff there?

Tcl has all FS-code abstracted in a form of pluggable filesystems.
So, you never actually go down to the OS-level without passing N layers
of Tcl code. The whole stuff is pretty frightening (just go and look
into the generic/tclIOUtil.c and you will see why) but it does gives
you a powerful abstraction.

Again, I'm sorry if I misunderstood your question(s). I think that
you can't protect yourself from the effects of the [cd]. You will,
however need to know (from the docs, I suspect) wether some code
does change the cwd, and if it does, ypu must take care to put the
locks arround it, if necessary for your app.

I think it would be good to know what exact spots in Tcl might be
potential [cd] doers. I will try to grep the code and see what
is done where.


Cheers,
Zoran


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-27 Thread Andrew Piskorski
On Thu, Mar 27, 2003 at 10:15:33AM +0100, Zoran Vasiljevic wrote:
> On Wednesday 26 March 2003 20:03, you wrote:
>
> > 1. Are there ANY cases where a TLS (thread local storage) cd/pwd would
> > be the WRONG thing?  E.g., could we reasonally expect there to be any
> > cases where one thread A does "pwd", gets "/home/my", does "cd /foo",
> > and then thread B does "pwd" and EXPECTS to get "/foo" back as the
> > result?  Any such cases in Tcl?  In AOLserver?  In libgcc?  Anywhere
> > in any C libraries whatsoever?
>
> I did not understand.
> The scenario you've used (thread A/B) is what *is* usually happening.
> The thread B gets "/foo" and this is what is expected.

Yes, of course, and it can often breaks things, right?  Most users of
cd/pwd seem to want to neglect the fact that the pwd state is
per-process not per-thread, so perhaps it would be simpler just to
give them that, thus side-stepping the locking issues.

Therefore, assume a TLS cd/pwd implementation, where each and every
thread has it's OWN independent current working directory.  Does any
code break because of this?  In other words, is there any code
anywhere that both, a), properly treats pwd as per-process not
per-thread, and b), DEPENDS on the sharing of pwd across threads for
its correct behavior?

> But, this is not the source of the problem. The source of the problem
> is that Tcl internal code remembers the process-wide current working
> directory in an global Tcl object and neglects to properly lock access
> to this object for ALL operations, in addition to blindly passing the
> same object to other threads.

And if that object were TLS, per-thread, then the error could not
occur.  It might cause OTHER problems, which is what my questions
above were getting at, and it might plain not be feasible to
implement.  But a TLS pwd would solve the "blindly passing a
process-wide object to other threads" problem above, because it
wouldn't be process-wide anymore, it'd be per-thread.

Clearly from what you say it's at least a bit more complicated than
that though.  I probably can't contribute any more usefully to the
discussion until/unless I go look at the Tcl core code you're talking
about...

> We'd need to talk to the Tcl FS implementor and get him/them abandon
> the object storage in favour of simple char* or put the object
> into the TSD.

What's the TSD?

> Again, I'm sorry if I misunderstood your question(s). I think that
> you can't protect yourself from the effects of the [cd]. You will,

I wasn't trying to protect myself, I was hypothesizing about possible
designs/implementations for fixing the problem in Tcl.  :)

The fact that some system calls apparently change pwd and then change
it back when they complete is really obnoxious, because that means
that for proper mt-safe operation, those system calls MUST expose a
mutex which any of YOUR pwd-changing code will explicitly lock and
unlock as well.  This is a general obnoxious feature of multi-threaded
programming with multiple libraries - a given mutex sometimes must be
seen and used across ALL libraries, not just within one library.

I've run into that before and solved it by overriding library
functions with my own mt-safe TLS versions of those functions, which
let me avoid having to try to share a single mutex across multiple
non-cooperatiung libraries.  Thus I immediatly wondering about the
possibility of using TLS to solve the cd/pwd bugs you described...

> I think it would be good to know what exact spots in Tcl might be
> potential [cd] doers. I will try to grep the code and see what

Definitely, that would be good to know.

--
Andrew Piskorski <[EMAIL PROTECTED]>
http://www.piskorski.com


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-27 Thread Jim Wilcoxson
It seems that if people wanted to use generally-available TCL packages
inside AOL, it'd be nice if cd worked and was per-thread.  Otherwise,
even if the TCL internals are fixed to lock around the temporary cd
fiddling it does in realpath or wherever else, ordinary TCL packages
still won't work in a threaded environment if they use cd.  We've
run into this problem ourselves, when we had to bring external TCL
scripts into the web server, for coordination & scheduling purposes.

Jim

>
> On Thu, Mar 27, 2003 at 10:15:33AM +0100, Zoran Vasiljevic wrote:
> > On Wednesday 26 March 2003 20:03, you wrote:
> >
> > > 1. Are there ANY cases where a TLS (thread local storage) cd/pwd would
> > > be the WRONG thing?  E.g., could we reasonally expect there to be any
> > > cases where one thread A does "pwd", gets "/home/my", does "cd /foo",
> > > and then thread B does "pwd" and EXPECTS to get "/foo" back as the
> > > result?  Any such cases in Tcl?  In AOLserver?  In libgcc?  Anywhere
> > > in any C libraries whatsoever?
> >
> > I did not understand.
> > The scenario you've used (thread A/B) is what *is* usually happening.
> > The thread B gets "/foo" and this is what is expected.
>
> Yes, of course, and it can often breaks things, right?  Most users of
> cd/pwd seem to want to neglect the fact that the pwd state is
> per-process not per-thread, so perhaps it would be simpler just to
> give them that, thus side-stepping the locking issues.
>
> Therefore, assume a TLS cd/pwd implementation, where each and every
> thread has it's OWN independent current working directory.  Does any
> code break because of this?  In other words, is there any code
> anywhere that both, a), properly treats pwd as per-process not
> per-thread, and b), DEPENDS on the sharing of pwd across threads for
> its correct behavior?
>
> > But, this is not the source of the problem. The source of the problem
> > is that Tcl internal code remembers the process-wide current working
> > directory in an global Tcl object and neglects to properly lock access
> > to this object for ALL operations, in addition to blindly passing the
> > same object to other threads.
>
> And if that object were TLS, per-thread, then the error could not
> occur.  It might cause OTHER problems, which is what my questions
> above were getting at, and it might plain not be feasible to
> implement.  But a TLS pwd would solve the "blindly passing a
> process-wide object to other threads" problem above, because it
> wouldn't be process-wide anymore, it'd be per-thread.
>
> Clearly from what you say it's at least a bit more complicated than
> that though.  I probably can't contribute any more usefully to the
> discussion until/unless I go look at the Tcl core code you're talking
> about...
>
> > We'd need to talk to the Tcl FS implementor and get him/them abandon
> > the object storage in favour of simple char* or put the object
> > into the TSD.
>
> What's the TSD?
>
> > Again, I'm sorry if I misunderstood your question(s). I think that
> > you can't protect yourself from the effects of the [cd]. You will,
>
> I wasn't trying to protect myself, I was hypothesizing about possible
> designs/implementations for fixing the problem in Tcl.  :)
>
> The fact that some system calls apparently change pwd and then change
> it back when they complete is really obnoxious, because that means
> that for proper mt-safe operation, those system calls MUST expose a
> mutex which any of YOUR pwd-changing code will explicitly lock and
> unlock as well.  This is a general obnoxious feature of multi-threaded
> programming with multiple libraries - a given mutex sometimes must be
> seen and used across ALL libraries, not just within one library.
>
> I've run into that before and solved it by overriding library
> functions with my own mt-safe TLS versions of those functions, which
> let me avoid having to try to share a single mutex across multiple
> non-cooperatiung libraries.  Thus I immediatly wondering about the
> possibility of using TLS to solve the cd/pwd bugs you described...
>
> > I think it would be good to know what exact spots in Tcl might be
> > potential [cd] doers. I will try to grep the code and see what
>
> Definitely, that would be good to know.
>
> --
> Andrew Piskorski <[EMAIL PROTECTED]>
> http://www.piskorski.com
>
>
> --
> AOLserver - http://www.aolserver.com/
> To Remove yourself from this list: http://www.aolserver.com/listserv.html
> List information and options: http://listserv.aol.com/
>


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-27 Thread Rob Mayoff
+-- On Mar 27, Andrew Piskorski said:
> And if that object were TLS, per-thread, then the error could not
> occur.  It might cause OTHER problems, which is what my questions
> above were getting at, and it might plain not be feasible to
> implement.  But a TLS pwd would solve the "blindly passing a
> process-wide object to other threads" problem above, because it
> wouldn't be process-wide anymore, it'd be per-thread.

Perhaps you do not realize that a process's current working directory
is tracked by the kernel, not by the process. Tcl keeps track of its
CWD for speed, but ultimately it's the kernel, not the process, that
resolves relative pathnames, so it's the kernel's idea of the CWD that
matters.

I believe that POSIX requires that all threads in a process share a
working directory. Making each thread appear to have its own working
directory requires either non-standard kernel support for per-thread
CWD (which Linux has, but I don't think you can get to it through the
pthreads interface), or intercepting every system call that involves
a pathname (open, link, symlink, unlink, rename, access, stat, lstat,
chdir, chroot, chmod, chown, lchown, mknod, mkdir, rmdir, bind, connect,
and probably some more that I've forgotten). You might be able to
ignore some of these for AOLserver, but intercepting any of them isn't
necessarily easy, and it's definitely not possible to do so portably.

It still might be the best way to fix this problem, though.

> What's the TSD?

Thread-specific data.  Same thing as TLS.

> The fact that some system calls apparently change pwd and then change
> it back when they complete is really obnoxious,

No. Some library calls change the CWD. The only system calls that change
the CWD are chdir() and fchdir().


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/


Re: [AOLSERVER] BIG problem with Tcl8.4.(1|2)

2003-03-28 Thread Zoran Vasiljevic
On Thursday 27 March 2003 22:28, you wrote:

> Yes, of course, and it can often breaks things, right?  Most users of
> cd/pwd seem to want to neglect the fact that the pwd state is
> per-process not per-thread, so perhaps it would be simpler just to
> give them that, thus side-stepping the locking issues.
>

Eh, the cwd is the thing which is used by most path-related
sys/lib calls to resolve the absolute path of the file.
It is tracked in the kernel, not in the process, so in order
to make this happen, you ought to intercept *all* of the
sys/lib calls fiddling with paths.
Now, Tcl with its virtual filesystem *might* achieve this, since
it really isolates the upper layers from the OS-specifics.
But, if you ask me, I think this is voodoo.

> Therefore, assume a TLS cd/pwd implementation, where each and every
> thread has it's OWN independent current working directory.  Does any
> code break because of this?  In other words, is there any code
> anywhere that both, a), properly treats pwd as per-process not
> per-thread, and b), DEPENDS on the sharing of pwd across threads for
> its correct behavior?

In an ideal world, this is correct. It holds true until somebody
gives you some extension which calls os-calls directly, not using
your own notion of cwd. This is the time when things start to
be very interesting (and cost lot of time/money to locate them).
I know this per-thread working directory thing is very appealing
but it should have been designed so by the standard and/or
kernel-makers.  If we try to emulate this on the very high level,
we'll end up in a mess, sooner or later.

>
> What's the TSD?
>

It is the term used throughout the Tcl sources to refer to
thread specific data. In AOLserver, people talk about
thread local storage which is essentially the same thing.


> I wasn't trying to protect myself, I was hypothesizing about possible
> designs/implementations for fixing the problem in Tcl.  :)
>

To be honest, I was also playing with this idea, but after giving
it a serious thought, I've abandoned it.


> The fact that some system calls apparently change pwd and then change
> it back when they complete is really obnoxious, because that means
> that for proper mt-safe operation, those system calls MUST expose a
> mutex which any of YOUR pwd-changing code will explicitly lock and
> unlock as well.  This is a general obnoxious feature of multi-threaded
> programming with multiple libraries - a given mutex sometimes must be
> seen and used across ALL libraries, not just within one library.
>

Yes, some library calls, like realpath() in MacOSX (see my other posting)
are doing this. They call chdir/fchdir syscalls without you knowing this
possibly interacting with your own code in a different thread.
This has driven me nuts for couple of days while porting our app to Darwin.
Yet, there is no word about that in the manpage!
I think this is one area where standard-makers must do some more work.
The fact as it is now, I'm grepping thru all code, looking for obvious
calls that might change cwd (chdir/fchdir). This covers most of the cases.
But, there are hidden ones, and you just can't fight them back until they
show they ugly head.

I think I will discuss this issue with some Tcl people and try to produce
a list of Tcl functions which might directly or indirectly change your
working directory.
Funny think about the MacOSX: even the innocent [pwd] changed the current
dir! This is really crazy!

Cheers,
Zoran


--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list: http://www.aolserver.com/listserv.html
List information and options: http://listserv.aol.com/