Re: RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Aleksej Efimov Fri, 16 Aug 2013 07:53:31 -0700

Chris,
Thank you for review and for sponsorship.
 I have prepared hg changeset patches for root and jdk repositories:
root repo patch: http://cr.openjdk.java.net/~aefimov/8021820/jdk8_hg.export

jdk repo patch:http://cr.openjdk.java.net/~aefimov/8021820/jdk_jdk8_hg.export

I think if there is no objections - we can do the push.


Thank you
-Aleksej


On 16.08.2013 14:35, Chris Hegarty wrote:

I see no objections to this, and I think it has been through enoughreview. I'll sponsor this change for Aleksej.


-Chris.

On 13/08/2013 10:20, Aleksej Efimov wrote:

Stuart,
Thanks for your comments.
New webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.03/
<http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.03/>
Comments below.

Thanks,
Aleksej

On 08/08/2013 06:10 AM, Stuart Marks wrote:

Hi Aleksej,

Thanks for the update. The situation is a bit twisted.

I picked up a couple clues from David Holmes and Jon Gibbons. The
NoClassDefFoundError occurs when the JVM has hit its resource limit
for the number of open files, *and* it is being run in a development
environment with individual class files in a hierarchy, e.g. within

ROOT/build/<platform>/jdk/classes

In this case, since each class is in its own file, it's quite likely
that the loading of an individual class will fail because of lack of
file descriptors. If the JVM itself encounters this, it will generate

a NoClassDefFoundError without a stack trace such as the ones we'veseen.


If the test is being run against a fully built JDK image, classes are
loaded from rt.jar. This is usually already open, so quite often
classes can be loaded without having to open additional files. In this
case we get the FileNotFoundException as expected.

It was very interesting and enlightening information. Thank you for
that, I'll keep it in mind.


The resource limits seem to vary from system to system, and even from
one Ubuntu version to the next (mine has a default hard limit of 1024
open files). Unfortunately, while it might seem reasonable to have
minimum specifications for systems on which we run tests, in practice
this has proven very difficult. Since the bug being fixed is Mac-only,
and the default open file limit for Mac seems to be unlimited, perhaps
it makes most sense for this to be a Mac-only test.

From the discussion here [1] which refers to an Apple technote [2] it
seems like the best way to test for being on a Mac is something like
this:

if (! System.getProperty("os.name").contains("OS X")) {
return;
}

That is, report success if we're on anything other than a Mac.

Agreed, there is no need to run this test on other platforms (the bug
states only MacOSX) and as a good addition we don't need to worry about
limits on other platforms. The suggested check was added to test. And it
was executed on all platforms (without fix): all except MacOSX passes,

on MacOSX expected result - "java.net.SocketException: Invalidargument".


Once we're sure we're on a Mac, having the test fail if it can't open
enough files seems reasonable.

I'd suggest putting a comment at the top of the test class saying that
this test *must* be run in othervm mode, to ensure that files are
closed properly. That way, you can take out the cleanupFiles() method
too, as well as avoiding lots of exception handling and related
cleanup code.

Comment was added and cleanupFiles() method was removed as suggested.


Finally, a style point: try/catch statements are conventionally
indented as a single multi-block, not as separate statements. I'd
suggest something like this:

// The accept() call will throw SocketException if the
// select() has failed due to limitation on fds size,
// indicating test failure. A SocketTimeoutException
// is expected, so it is caught and ignored, and the test
// passes.

try {
socket.accept();
} catch (SocketTimeoutException e) { }

Fixed.


s'marks


[1]

http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html

[2]https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html




On 8/7/13 6:01 AM, Aleksej Efimov wrote:

Stuart, thank you for you comments, responses are below.
New webrev:
http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
<http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>


-Aleksej

On 08/06/2013 05:14 AM, Stuart Marks wrote:

Hi Aleksej,

Thanks for the update. I took a look at the revised test, and there
are still
some issues. (I didn't look at the build changes.)

1) System-specific resource limits.

I think the biggest issue is resource limits on the number of open
files per
process that might vary from system to system. On my Ubuntu system,
the hard
limit on the number of open files is 1,024. The test opens 1,023
files and
then one more for the socket. Unfortunately the JVM and jtreg have
several

files open already, and the test crashes before the openFiles()method

completes.

(Oddly it crashes with a NoClassDefFoundError from the main thread's
uncaught

exception handler, and then the test reports that it passed!Placing a

try/catch of Throwable in main() or openFiles() doesn't catch this
error. I
have no explanation for this. When run standalone -- i.e., not from
jtreg --
the test throws FileNotFoundException (too many open files) from
openFiles(),
which is expected.)

On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
unlimited. The test succeeds in opening all its files but fails
because of
the select() bug you're fixing. (This is expected; I didn't rebuild
my JDK
with your patch.) I guess the soft limit doesn't do anything on Mac.

Amazingly, the test passed fine on both Windows XP and Windows 8.

I'm not entirely sure what to do about resource limits. Since the
test is
able to open >1024 files on Mac, Windows, and possibly other
Linuxes, it
seems reasonable to continue with this approach. If it's possible to
catch

the error that occurs if the test cannot open its initial 1,024files,perhaps it should do this, log a message indicating what happened,and

consider the test to have passed. I'm mystified by the
uncaught/uncatchable
NoClassDefFoundError though.

I wonder if this is a question of test environment required for JTREG
tests: if
we'll execute JTREG tests with low value assigned to fd hard limit
(for example
10) we'll see a lot of unrelated test failures. So, I suggest that we
can
assume that there is no hard limits set (or at least default ones,
i.e. default
fd limit on Ubuntu is 4096) on test machine. But we should consider
test as
Failed if test failed to prepare it's environment because of some
external
limitations. The JTREG doesn't meet this criteria (log test as PASS
and prints
incorrect Exception). To illustrate it I have repeated your
experiments on
ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got
this error by
manual run of test:
Exception in thread "main" java.io.FileNotFoundException: testfile
(Too many
open files)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:128)
at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
at SelectFdsLimit.main(SelectFdsLimit.java:81)

Seems correct to me.
An by JTREG (also with hard limit set to 1024):
----------messages:(3/123)----------
command: main SelectFdsLimit
reason: User specified action: run main/othervm SelectFdsLimit
elapsed time (seconds): 0.168
----------System.out:(0/0)----------
----------System.err:(5/250)----------

Exception: java.lang.NoClassDefFoundError thrown from the
UncaughtExceptionHandler in thread "MainThread"
STATUS:Passed.
Exception in thread "main"
Exception: java.lang.NoClassDefFoundError thrown from the
UncaughtExceptionHandler in thread "main"
result: Passed. Execution successful


test result: Passed. Execution successful


The results are identical to results mentioned by you. It seems to me
that
jtreg doesn't correctly processes such test error (at least it
shouldn't be
considered as Pass). And I suggest two ways of resolving it:
1. If we don't have official limitations (or default) on what
resources test
can use then we shouldn't do any modifications to test.
2. If there is some limitations that we should honor then we'll need
to figure
out what to do with NoClassDefFoundError exception in JTREG.


2) Closing files.

If an exception is thrown while opening the initial set of files, or
sometime
during the closing process, the test can still leak files.

One approach would be to keep a data structure representing the
current set
of open files, and close them all in a finally-block around all the
test
logic, and making sure that exceptions from the close() call are
caught and
do not prevent the rest of the files from being closed.

This seems like a lot of work. Perhaps a more effective approach
would be to
run the test in "othervm" mode, as follows:

@run main/othervm SelectFdsLimit

This will cause the test to run in a dedicated JVM, so all files
will be

closed automatically when it exits. (It would be good to add acomment

explaining the need for othervm, if you do this.)

main/othervm and comments were added.

3) Port number for sockets.

It's fairly common for tests to fail occasionally because they usesome

constant port number that sometimes happens to be in use at the same
time by
another process on the system. I have to say, 8080 is a pretty
common port
number. :-)

For purposes of this test, you can let the system assign a port.
Just use:

new ServerSocket(0)

Completely agree that 8080 port - bad port for testing =). Changedto 0.

4) Style.

It's probably possible to use the same File object for the test
file, instead
of creating new File objects repeatedly.

Agree and corrected.


It might be nice to add a comment explaining the logic of the test,
that
SocketTimeoutException is expected, and that failure will be
indicated if the
accept() throws SocketException caused by the underlying mishandling
of large
fds by select().

Comments were added.


Thanks,

s'marks



On 8/5/13 4:47 AM, Aleksej Efimov wrote:

Alan, Tim,

I have addressed your comments and as a result - new webrev:
http://cr.openjdk.java.net/~aefimov/8021820/webrev.01

The list of changes:
1. The connection to Oracle site is removed (it's not internal, but
anyway it's
better not to rely on availability of external resource in test).
In current

version a server socket is created and accept() method is usedfor bug

disclosure.
2. The cleanup method is added for closing file streams. The JTREG
successfully
cleaned-up on windows after this modification.
3. common/autoconf/toolchain.m4 untouched, but 'bash
common/autoconf/autogen.sh' was executed to update
generated-configure.sh.

Aleksej


On 07/31/2013 06:35 PM, Tim Bell wrote:

Aleksej, Alan

The change in common/autoconf/toolchain.m4 looks correct to me,
and I think
that is a good place to have it. Remember to run 'bash
common/autoconf/autogen.sh' and check in the
generated-configure.sh files as
part of the changeset.

I didn't look at the test case, but I think Alan has some good
points.

Tim

On 07/31/13 06:45 AM, Alan Bateman wrote:

On 31/07/2013 05:18, Aleksej Efimov wrote:

Hi,
Can I have a review for the following problem:
The MACOSX JDK (more precisely - the java.net classes) uses the
select()
system call to wait for different events on sockets fds. And the
default
behaviour for select() on Darwin is to fail when fdset contains
the fd with
id greater than FDSET_SIZE(=1024). Test case in webrev
illustrates this
behavior.
There is at least one solution for it: use
-D_DARWIN_UNLIMITED_SELECT
compilation flag for all macosx sources: this won't affect other
parts of
JDK because they are not using select().
Currently, I have added this compilation flag to
common/autoconf/generated-configure.sh and
common/autoconf/generated-configure.sh. I wonder, if there is a
better
place where I can put this flag?

The webrev:http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/

BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820


Thanks for looking into this one. The build changes look okay to
me but it's
probably best that someone on build-dev agree to those.
Michael McMahon can probably explain why the net code is using
select for
timed read/accept (I have a vague recollection of there being an
issue with
poll due to the way that it is implemented on kqueue with the
result that it
had to be changed to use select).

I think the test needs re-work. It looks to me that the it
attempts to
connect to an Oracle internal site so that's not going to work
everywhere.
In general we don't want the tests to be dependent on hosts that
may or may
not exist (we had tests that used to this in the past but they
caused a lot
of grief). It also looks like the test doesn't close the 1023
files that it
opens at the start and so I assume this test will always fail on
Windows
when jtreg tries to clean-up.

-Alan.

Re: RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Reply via email to