On 4/1/2020 2:34 PM, Ken Brown via Cygwin wrote:
On 4/1/2020 1:14 PM, sten.kristian.ivars...@gmail.com wrote:
On 4/1/2020 4:52 AM, sten.kristian.ivars...@gmail.com wrote:
On 3/31/2020 5:10 PM, sten.kristian.ivars...@gmail.com wrote:
On 3/28/2020 10:19 PM, Ken Brown via Cygwin wrote:
On 3/28/2020 11:43 AM, Ken Brown via Cygwin wrote:
On 3/28/2020 8:10 AM, sten.kristian.ivars...@gmail.com wrote:
On 3/27/2020 10:53 AM, sten.kristian.ivars...@gmail.com wrote:
On 3/26/2020 7:19 PM, Ken Brown via Cygwin wrote:
On 3/26/2020 6:39 PM, Ken Brown via Cygwin wrote:
On 3/26/2020 6:01 PM, sten.kristian.ivars...@gmail.com wrote:
The ENIXIO occurs when parallel child-processes
simultaneously using O_NONBLOCK opening the descriptor.

This is consistent with my guess that the error is
generated by fhandler_fifo::wait.  I have a feeling that
read_ready should have been created as a manual-reset
event, and that more care is needed to make sure it's set
when it should be.

[snip]

Never mind.  I was able to reproduce the problem and find the cause.
What happens is that when the first subprocess exits,
fhandler_fifo::close resets read_ready.  That causes the second
and subsequent subprocesses to think that there's no reader open,
so their attempts to open a writer with O_NONBLOCK fail with ENXIO.

[snip]

I wrote in a previous mail in this topic that it seemed to work fine
for me as well, but when I bumped up the numbers of writers and/or the
number of messages (e.g. 25/25) it starts to fail again

[snip]

Yes, it is a resource issue.  There is a limit on the number of writers
that can be open at one
time, currently 64.  I chose that number arbitrarily, with no idea what
might actually be
needed in practice, and it can easily be changed.

Does it have to be a limit at all ? We would rather see that the application
decide how much resources it would like to use. In our particular case there
will be a process-manager with an incoming pipe that possible several
thousands of processes will write to

I agree.

Just for fiddling around (to figure out if this is the limit that make other
things work a bit odd), where's this 64 limit defined now ?

It's MAX_CLIENTS, defined in fhandler.h.  But there seem to be other resource issues also; simply increasing MAX_CLIENTS doesn't solve the problem.  I think there are also problems with the number of threads, for example.  Each time your program forks, the subprocess inherits the rfd file descriptor and its "fifo_reader_thread" starts up.  This is unnecessary for your application, so I tried disabling it (in fhandler_fifo::fixup_after_fork), just as an experiment.

But then I ran into some deadlocks, suggesting that one of the locks I'm using isn't robust enough.  So I've got a lot of things to work on.

In addition, a writer isn't recognized as closed until a reader tries to
read and gets an error.
In your example with 25/25, the list of writers quickly gets to 64 before
the parent ever tries
to read.

That explains the behaviour, but should there be some error returned from
open/write (maybe it is but I'm missing it) ?

The error is discovered in add_client_handler, called from thread_func.  I think you'll only see it if you run the program under strace.  I'll see if I can find a way to report it.  Currently, there's a retry loop in fhandler_fifo::open when a writer tries to open, and I think I need to limit the number of retries and then error out.

I pushed a few improvements and bug fixes, and your 25/25 example now runs without a problem. I increased MAX_CLIENTS to 1024 just for the sake of this example, but I'll work on letting the number of writers increase dynamically as needed.

Ken
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to