Re: Multi Threaded programs deadlock doing simple I/O operations

2005-06-19 Thread Mark Pizzolato

On Sunday, June 12, 2005 T 5:37 PM, Mark Pizzolato wrote:

On Friday, June 10, 2005 at 3:44 PM, Mark Pizzolato wrote:

> On Thursday, June 09, 2005 at 6:12 PM, Mark Pizzolato wrote:
>> On Thursday, June 09, 2005 at 3:35 PM, Christopher Faylor wrote:
>> > On Wed, Jun 08, 2005 at 05:43:59PM -0700, Mark Pizzolato wrote:
>> > >There is a serious problem for multi threaded programs doing simple 
>> > >I/O

>> > >operations in cygwin (open, dup, fdopen, fclose, and close).
>> > >
>> > >The attached 81 line test program clearly demonstrates the issue 
>> > >(by
>> > >hanging and no longer consuming CPU or performing any I/O 
>> > >operations).

>> >
>> > Thanks for the relatively small test case.  That was enough to track 
>> > the

>> > problem down.  I'm generating a new snapshot with a fix for this
>> > problem.
>>
>> The snapshot looks good!
>>
>> This fixes the stability problems with clamav's clamd that I've been 
>> chasing

>> for a long time.
>
> Some more follow up here...I'm running with the 20050609 snapshot dll.
>
> clamav's clamd now runs better than it has ever for me on cygwin.
>
>   until "it doesn't",
>
> once it starts to run poorly it won't run cleanly again until I reboot 
> the system

> (I haven't actually tried after merely exiting all processes ..)


Well, i spoke too soon here.  There may be some interaction with many 
recently closed tcp sessions sitting in TIME_WAIT.  I'm not sure, but 
after some time, I can restart and experience aparrently good behavior and 
then things get "poor" as described.


If I run with the 20050607 snapshot, the new "poor" behavior doesn't 
happen, while the test program I provided earlier in this thread hangs as 
described. So, the fix to the original problem and the new "poor" behavior 
are clearly related to changes between the 20050607 and the 20050609 
snapshots.



> To be more specific about the "poor" behavior:
>
>
> - pthread_unlock_mutex fails leaving errno with a value of 90.  This is 
> in a place where there is only one path through about a dozen lines of 
> code and the mutex is definately locked.  there may have been a call to 
> pthread_create, and a definate call to pthread_cond_signal.
> - once the above error happens, calls (by the same thread) to accept() 
> fail using a file descriptor which we've been successfully using all 
> along and only close when the program exists.

>
> so some change introduced recently (since 1.5.17-1), and possibly in 
> 20050609 fixes the dup() issue but now mutex operations are failing in 
> strange ways.

>
> Sorry not to have a simple isolated test case for this.  The good news 
> is that once it breaks it won't run correcfly again until a reboot.


I'm working on a test program to recreate this behavior.


Well...  The problem wasn't in cygwin.

As it happens in clamav's clamd there were several pthread_mutex_t objects
which weren't initialized to reasonable values (i.e. left to be zero instead 
of

PTHREAD_MUTEX_INITIALIZER).  Calls to pthread_mutex_lock and
pthread_mutex_unlock on the uninitialized objects, depending on timing and
sequence aparrently confused some aspect of mutex processing causing
other calls to pthread_mutex_lock and pthread_mutex_unlock to fail in
strange ways.

Appropriate patches have been submitted to the clamav team.

- Mark Pizzolato 



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: Multi Threaded programs deadlock doing simple I/O operations

2005-06-12 Thread Mark Pizzolato

On Friday, June 10, 2005 at 3:44 PM, Mark Pizzolato wrote:

> On Thursday, June 09, 2005 at 6:12 PM, Mark Pizzolato wrote:
>> On Thursday, June 09, 2005 at 3:35 PM, Christopher Faylor wrote:
>> > On Wed, Jun 08, 2005 at 05:43:59PM -0700, Mark Pizzolato wrote:
>> > >There is a serious problem for multi threaded programs doing simple 
>> > >I/O

>> > >operations in cygwin (open, dup, fdopen, fclose, and close).
>> > >
>> > >The attached 81 line test program clearly demonstrates the issue (by
>> > >hanging and no longer consuming CPU or performing any I/O 
>> > >operations).

>> >
>> > Thanks for the relatively small test case.  That was enough to track 
>> > the

>> > problem down.  I'm generating a new snapshot with a fix for this
>> > problem.
>>
>> The snapshot looks good!
>>
>> This fixes the stability problems with clamav's clamd that I've been 
>> chasing

>> for a long time.
>
> Some more follow up here...I'm running with the 20050609 snapshot dll.
>
> clamav's clamd now runs better than it has ever for me on cygwin.
>
>   until "it doesn't",
>
> once it starts to run poorly it won't run cleanly again until I reboot 
> the system

> (I haven't actually tried after merely exiting all processes ..)


Well, i spoke too soon here.  There may be some interaction with many 
recently closed tcp sessions sitting in TIME_WAIT.  I'm not sure, but after 
some time, I can restart and experience aparrently good behavior and then 
things get "poor" as described.


If I run with the 20050607 snapshot, the new "poor" behavior doesn't happen, 
while the test program I provided earlier in this thread hangs as described. 
So, the fix to the original problem and the new "poor" behavior are clearly 
related to changes between the 20050607 and the 20050609 snapshots.



> To be more specific about the "poor" behavior:
>
>
> - pthread_unlock_mutex fails leaving errno with a value of 90.  This is 
> in a place where there is only one path through about a dozen lines of 
> code and the mutex is definately locked.  there may have been a call to 
> pthread_create, and a definate call to pthread_cond_signal.
> - once the above error happens, calls (by the same thread) to accept() 
> fail using a file descriptor which we've been successfully using all 
> along and only close when the program exists.

>
> so some change introduced recently (since 1.5.17-1), and possibly in 
> 20050609 fixes the dup() issue but now mutex operations are failing in 
> strange ways.

>
> Sorry not to have a simple isolated test case for this.  The good news 
> is that once it breaks it won't run correcfly again until a reboot.


I'm working on a test program to recreate this behavior.

- Mark Pizzolato 



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: Multi Threaded programs deadlock doing simple I/O operations

2005-06-10 Thread Mark Pizzolato

On Thursday, June 09, 2005 at 6:12 PM, Mark Pizzolato wrote:

On Thursday, June 09, 2005 at 3:35 PM, Christopher Faylor wrote:
> On Wed, Jun 08, 2005 at 05:43:59PM -0700, Mark Pizzolato wrote:
> >There is a serious problem for multi threaded programs doing simple I/O
> >operations in cygwin (open, dup, fdopen, fclose, and close).
> >
> >The attached 81 line test program clearly demonstrates the issue (by
> >hanging and no longer consuming CPU or performing any I/O operations).
>
> Thanks for the relatively small test case.  That was enough to track the
> problem down.  I'm generating a new snapshot with a fix for this
> problem.

The snapshot looks good!

This fixes the stability problems with clamav's clamd that I've been 
chasing

for a long time.


Some more follow up here...I'm running with the 20050609 snapshot dll.

clamav's clamd now runs better than it has ever for me on cygwin.

  until "it doesn't",

once it starts to run poorly it won't run cleanly again until I reboot the 
system

(I haven't actually tried after merely exiting all processes ..)

To be more specific about the "poor" behavior:


- pthread_unlock_mutex fails leaving errno with a value of 90.  This is in 
a place where there is only one path through about a dozen lines of code and 
the mutex is definately locked.  there may have been a call to 
pthread_create, and a definate call to pthread_cond_signal.
- once the above error happens, calls (by the same thread) to accept() fail 
using a file descriptor which we've been successfully using all along and 
only close when the program exists.


so some change introduced recently (since 1.5.17-1), and possibly in 
20050609 fixes the dup() issue but now mutex operations are failing in 
strange ways.


Sorry not to have a simple isolated test case for this.  The good news is 
that once it breaks it won't run correcfly again until a reboot.


Ideas?

Thanks.

- Mark Pizzolato 



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: Multi Threaded programs deadlock doing simple I/O operations

2005-06-09 Thread Mark Pizzolato

On Thursday, June 09, 2005 at 3:35 PM, Christopher Faylor wrote:

On Wed, Jun 08, 2005 at 05:43:59PM -0700, Mark Pizzolato wrote:
>There is a serious problem for multi threaded programs doing simple I/O
>operations in cygwin (open, dup, fdopen, fclose, and close).
>
>The attached 81 line test program clearly demonstrates the issue (by
>hanging and no longer consuming CPU or performing any I/O operations).

Thanks for the relatively small test case.  That was enough to track the
problem down.  I'm generating a new snapshot with a fix for this
problem.


The snapshot looks good!

This fixes the stability problems with clamav's clamd that I've been chasing 
for a long time.


Thanks.

- Mark Pizzolato 



--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Re: Multi Threaded programs deadlock doing simple I/O operations

2005-06-09 Thread Christopher Faylor
On Wed, Jun 08, 2005 at 05:43:59PM -0700, Mark Pizzolato wrote:
>There is a serious problem for multi threaded programs doing simple I/O 
>operations in cygwin (open, dup, fdopen, fclose, and close).
>
>The attached 81 line test program clearly demonstrates the issue (by 
>hanging and no longer consuming CPU or performing any I/O operations).

Thanks for the relatively small test case.  That was enough to track the
problem down.  I'm generating a new snapshot with a fix for this
problem.

cgf

--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/



Multi Threaded programs deadlock doing simple I/O operations

2005-06-08 Thread Mark Pizzolato
There is a serious problem for multi threaded programs doing simple I/O 
operations in cygwin (open, dup, fdopen, fclose, and close).


The attached 81 line test program clearly demonstrates the issue (by hanging 
and no longer consuming CPU or performing any I/O operations).


I'm sure that anyone who ever encountered a stange hang in any program 
running under cygwin would appreciate a fix for this issue.


- Mark Pizzolato

#include 
#include 
#include 
#include 
#include 
#include 

pthread_mutex_t log_mutex = PTHREAD_MUTEX_INITIALIZER;

void
logit(const char *fmt, ...) {
va_list args;
char buf[1024];
int bytes;

buf[sizeof(buf)-1] = '\0';
va_start(args, fmt);
bytes = vsnprintf(buf, sizeof(buf)-1, fmt, args);
va_end(args);
pthread_mutex_lock(&log_mutex);
printf("%d:", pthread_self());
printf("%s", buf);
pthread_mutex_unlock(&log_mutex);
}

struct TestIoInfo {
int Iterations;
int Progress;
};

void *
TestIoThread (void *arg) {
struct TestIoInfo *t = (struct TestIoInfo *)arg;
int i, j;
int fd, fdd;
char FileName[255];
FILE *f;

logit("IO Thread %d starting...\n", pthread_self());
snprintf(FileName, sizeof(FileName), "/tmp/TestIoThread-%d-%x", 
getpid(), pthread_self());
sleep(1);
for (j=0; jIterations; ++j) {
if ((fd = open(FileName, O_RDWR|O_CREAT|O_TRUNC|O_BINARY, 
S_IRWXU)) < 0) {
logit("Error Opening File: %s - %d\n", FileName, errno);
return;
}
fdd = dup(fd);
if ((f = fdopen(fdd, "rb")) == NULL) {
logit("Can't open descriptor %d - %d\n", fd, errno);
return;
}
fclose(f);
close(fd);
if (0 == (j%t->Progress)) {
logit("IO Thread %d - %d\n", pthread_self(), j);
}
}
unlink(FileName);
logit("IO Thread %d done.\n", pthread_self());
return NULL;
}

main (int argc, char ** argv) {
int threadcount = 4;
int progress = 1;
pthread_t tid[10];
int i;
struct TestIoInfo IoInfo;

logit("Testing with %d concurrent threads\n", threadcount);
logit("Progress indicated every %d operations...\n", progress);
IoInfo.Iterations = 200;
IoInfo.Progress = progress;
for (i=0; i--
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple
Problem reports:   http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ:   http://cygwin.com/faq/