David, I just re-read your comments towards the end of your previous email:
"I wonder if we are suffering a similar problem in any other cases; if it was so, we might need to delay the opening of the ServerSocket until the LIST (or GET or PUT...) commands are executed" Do you think creating/binding a new ServerSocket could potentially take a long time? Is that your concern? Regards, Sai Pullabhotla On Fri, Mar 26, 2010 at 7:11 AM, David Latorre <dvl...@gmail.com> wrote: > 2010/3/26 Niklas Gustavsson <nik...@protocol7.com>: >> On Fri, Mar 26, 2010 at 9:50 AM, Fred Moore <fred.moor...@gmail.com> wrote: >>> 1\ Priority of passive port sharing ehnancement: Niklas survey shows that we >>> are indeed in good company here, but it's problably worth having a better >>> look at this anyway, there might be good technical reasons that led all the >>> other teams not to support this or it may turn up that it's "simply" because >>> it's somewhat hard to develop and test. >> >> After this discussion I'm significantly less thrilled at implementing >> shared passive ports :-) > > Shared passive ports would be a nice feature if they aren't too hard > to implement. Among the opensource servers, I think coloradoFTP -a > NIO-based java FTPServer under the LGPL license- offered this (since > their data connections also use async sockets this shouldn't be too > hard for them, but I don't know if they solved the use case depicted > by Sai: when there are several sessions open from the same IP) but it > seems that commercial solutions offer this and more... > > > >>> 2\ Quick fix for 1.0.x codebase: pushing a 40x to the client when no >>> passive port is available (or probably better: no passive port is available >>> within X seconds) it's probably something we need to do anyway. >> >> Thinking some more about this, I'm personally now convinced that >> should simple return an error (not waiting). I'm not sure what the >> best reply code should be, but "425 Can't open data connection" seems >> fitting although not specified as valid return from the PASV command. >> >>> 3\ Suspect race condition: the problem description for the originally >>> reported http://issues.apache.org/jira/browse/FTPSERVER-359 (see also repro >>> code attached to the jira) actually hints also to something different as >>> well, in fact we state that a few (say 20) parallel threads issuing LISTs in >>> passive mode are able to "lock-up" the server forever. Questions: >>> >>> 3.1\ Is this interely explained by this thread discussion? (I don't think >>> so: the server should *always* be able to recover) >> >> Agreed, the server should always recover from a situation like this. >> After looking into how to fix item 2, we need to rerun your tests and >> make sure we always survive. > > Thinking about this issue my understanding of the problem is as follows: > > 1. We have a number of connections to FTPServer > the Executor > threadpool max size (I think it is 16) sending the PASV command. > > 2. The first one of them requests the only available port and gets it. > Now the port is in use by a server socket and any subsequent call to > requestPassivePort will end up invoking wait(). > > 3. The thread that processed this PASV command is now available and a > new PASV request is assigned to it. > > 4. Now all threads are trying to request a passive port, but since > there are no ports available all the threads in the OrderedThreadPool > get blocked by the wait() method. > > I wonder if we are suffering a similar problem in any other cases; if > it was so, we might need to delay the opening of the ServerSocket > until the LIST (or GET or PUT...) commands are executed. > > I hope I made myself clear and that my understanding was right. > > >>> 3.2\ Would this be fixed by a quick fix as per 2\? (likely, but it's sort of >>> like using nukes to for mowing the lawn) >> >> I really have no idea, but I think we should fix 2 first and then make >> sure we handle your test case. >> >>> In short my current position can be stated as follows: I think that >>> FTPSERVER-359 has a different root cause from what we discussed, the problem >>> impact is not completely known at the moment but it appears to *severely* >>> affect the server availabily... having just one port is an easy way of >>> reproducing it (but not the cause of it). >> >> Agreed. >> >> /niklas >> >