[jira] [Comment Edited] (OAK-6500) NRTIndex leaks file handles due to unclosed IndexReader
[ https://issues.apache.org/jira/browse/OAK-6500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108650#comment-16108650 ] Chetan Mehrotra edited comment on OAK-6500 at 8/1/17 11:16 AM: --- [patch|^OAK-6500-ref-count-v1.patch] using ref counting approach i.e. #B above (also see git branch details in next comment) Here we refactor existing {{IndexNode}} class to 2 * IndexNode - Interface which captures the "api" of existing IndexNode. This ensures that code which made use of IndexNode does not change. Note that all client code use {{IndexTracker}} to obtain {{IndexNode}} instances * IndexNodeManager - This class now has the logic present in earlier IndexNode class and is used by {{IndexTracker}} to obtain {{IndexNode}} instances and pass them to client code h3. Ref Counting This refactoring was required as we now need to maintain some state on per acquire call. IndexNodeManager and NRTIndex now rely on ref counting support in built in [IndexReader|https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/index/IndexReader.html#incRef()] to ensure that reader instance gets closes when all references have been released. The counting is managed in following way # At time of creation of IndexReader the count = (0 -> 1) #* Normal reader instances created once when IndexNodeManager is created have the count 1 #* IndexReader for NRTIndex are also 1 when created in NRTIndex # IndexSearcher creation increments the count to 2 - IndexNodeManager creates new IndexSearcher and passes them either MultiReader or normal IndexReader. At that time (IndexNodeManager#createReader) the count is cremented once # Upon each acquire of IndexNode the count is further incremented (2->3) # Then upon each release of IndexNode the count is decremented by 1 (3 -> 2). This compliments #3 # Then for each refresh when a new IndexSearcher is created the count is again decremented (2 -> 1). This complements #2 # Then upon close #* Normal reader instance are closed in IndexNodeManager#close. The close call decrements the count to 0 (1 -> 0) #* IndexReader for NRTIndex - The decrement happens in NRTIndex#createReader call where if a new reader is created then decRef is called for previous reader This approach ensures that reader ref count should eventually reach 0 and thus get closed. h3. Asserting RefCount In addtion NRTIndex exposes a flag to assert that reader ref counts are zero by the time NRTIndex instance is closed. This assert can be enabled via system property {{oak.lucene.assertAllResourcesClosed}}. The test setups enable this flag to check that all resources are getting cleaned up [~catholicon] - This patch changes a core component IndexNode. Due to rename the diff is tricky. So would be helpful if you can review it once [~teofili] - Would also be helpful if you can review the patch wrt Lucene api usage was (Author: chetanm): [patch|^OAK-6500-ref-count-v1.patch] using ref counting approach i.e. #B above Here we refactor existing {{IndexNode}} class to 2 * IndexNode - Interface which captures the "api" of existing IndexNode. This ensures that code which made use of IndexNode does not change. Note that all client code use {{IndexTracker}} to obtain {{IndexNode}} instances * IndexNodeManager - This class now has the logic present in earlier IndexNode class and is used by {{IndexTracker}} to obtain {{IndexNode}} instances and pass them to client code h3. Ref Counting This refactoring was required as we now need to maintain some state on per acquire call. IndexNodeManager and NRTIndex now rely on ref counting support in built in [IndexReader|https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/index/IndexReader.html#incRef()] to ensure that reader instance gets closes when all references have been released. The counting is managed in following way # At time of creation of IndexReader the count = (0 -> 1) #* Normal reader instances created once when IndexNodeManager is created have the count 1 #* IndexReader for NRTIndex are also 1 when created in NRTIndex # IndexSearcher creation increments the count to 2 - IndexNodeManager creates new IndexSearcher and passes them either MultiReader or normal IndexReader. At that time (IndexNodeManager#createReader) the count is cremented once # Upon each acquire of IndexNode the count is further incremented (2->3) # Then upon each release of IndexNode the count is decremented by 1 (3 -> 2). This compliments #3 # Then for each refresh when a new IndexSearcher is created the count is again decremented (2 -> 1). This complements #2 # Then upon close #* Normal reader instance are closed in IndexNodeManager#close. The close call decrements the count to 0 (1 -> 0) #* IndexReader for NRTIndex - The decrement happens in NRTIndex#createReader call where if a new reader is created then decRef is called for previous reader This approach e
[jira] [Comment Edited] (OAK-6500) NRTIndex leaks file handles due to unclosed IndexReader
[ https://issues.apache.org/jira/browse/OAK-6500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104499#comment-16104499 ] Chetan Mehrotra edited comment on OAK-6500 at 7/28/17 6:46 AM: --- Added ignored test with 1803248. The test uses [UnixOperatingSystemMXBean#maxFileDescriptorCount|https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html#getMaxFileDescriptorCount()] to check how many file handles are open. Hence would only work on unix setups. The test configures nrt logic to use SimpleFSDirectory which avoids memory mapped file access and hence any open file gets reflected in file descriptor count {noformat} Open file count - At start 180 Open file count - Post creation of 100 nodes at /content/a is 297 Open file count - Post async run at /content/a is 300 Open file count - Post creation of 100 nodes at /content/b is 424 Open file count - Post async run at /content/b is 424 Open file count - Post creation of 100 nodes at /content/c is 545 Open file count - Post async run at /content/c is 548 Open file count - Post creation of 1 nodes at /content/d is 549 Open file count - Post async run at /content/d is 549 Open file count - Post creation of 1 nodes at /content/e is 550 Open file count - Post async run at /content/e is 550 Open file count - At end 550 {noformat} >From the testcase the count can be seen increasing over time was (Author: chetanm): Added ignored test with 1803248. The test uses [UnixOperatingSystemMXBean#maxFileDescriptorCount|https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html#getMaxFileDescriptorCount()] to check how many file handles are open. Hence would only work on unix setups. The test configures nrt logic to use SimpleFSDirectory which avoids memory mapped file access and hence any open file gets reflected in file descriptor count {noformat} Open file count - Post creation at /content/a is 301 Open file count - Post async run at /content/a is 304 Open file count - Post creation at /content/b is 421 Open file count - Post async run at /content/b is 424 Open file count - Post creation at /content/c is 541 Open file count - Post async run at /content/c is 544 Open file count - Post creation at /content/d is 667 Open file count - Post async run at /content/d is 668 Open file count - Post creation at /content/e is 785 Open file count - Post async run at /content/e is 788 {noformat} >From the testcase the count can be seen increasing over time > NRTIndex leaks file handles due to unclosed IndexReader > --- > > Key: OAK-6500 > URL: https://issues.apache.org/jira/browse/OAK-6500 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.0 >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Critical > Fix For: 1.8, 1.7.5, 1.6.4 > > Attachments: OAK-6500-v1.patch > > > On some setups under stress it has been seen that NRTIndex leaks file handles > over time. > Checking with lsof indicates that more than 3 nrt folders per index are being > used. However per design there can be max 3 and after system is not in use > max 1 should be present. > {noformat} > $ lsof -p 9550 | grep '/nrt' | gawk 'match($0, > /.*crx-quickstart\/repository\/index\/(.*?)\/\_.*$/, m) { print m[1]; }' | > sort | uniq > cqPageLucene-1501065263331/nrt1501065335930 > cqPageLucene-1501065263331/nrt1501065374667 > cqPageLucene-1501065263331/nrt1501065392492 > cqPageLucene-1501065263331/nrt1501065440955 > cqPageLucene-1501065263331/nrt1501065473286 > cqPageLucene-1501065263331/nrt1501065507345 > slingeventJob-1501065263330/nrt1501065356975 > slingeventJob-1501065263330/nrt1501065373229 > slingeventJob-1501065263330/nrt1501065394142 > slingeventJob-1501065263330/nrt1501065440953 > slingeventJob-1501065263330/nrt1501065473282 > slingeventJob-1501065263330/nrt1501065507342 > versionStoreIndex-1501065263332/nrt1501065335925 > versionStoreIndex-1501065263332/nrt1501065366781 > versionStoreIndex-1501065263332/nrt1501065392490 > versionStoreIndex-1501065263332/nrt1501065441232 > versionStoreIndex-1501065263332/nrt1501065473285 > {noformat} > Further actually checking index folder indicates that those folder are > actually deleted. So some where the file handle is still referring them. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (OAK-6500) NRTIndex leaks file handles due to unclosed IndexReader
[ https://issues.apache.org/jira/browse/OAK-6500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104499#comment-16104499 ] Chetan Mehrotra edited comment on OAK-6500 at 7/28/17 6:16 AM: --- Added ignored test with 1803248. The test uses [UnixOperatingSystemMXBean#maxFileDescriptorCount|https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html#getMaxFileDescriptorCount()] to check how many file handles are open. Hence would only work on unix setups. The test configures nrt logic to use SimpleFSDirectory which avoids memory mapped file access and hence any open file gets reflected in file descriptor count {noformat} Open file count - Post creation at /content/a is 301 Open file count - Post async run at /content/a is 304 Open file count - Post creation at /content/b is 421 Open file count - Post async run at /content/b is 424 Open file count - Post creation at /content/c is 541 Open file count - Post async run at /content/c is 544 Open file count - Post creation at /content/d is 667 Open file count - Post async run at /content/d is 668 Open file count - Post creation at /content/e is 785 Open file count - Post async run at /content/e is 788 {noformat} >From the testcase the count can be seen increasing over time was (Author: chetanm): Added ignored test with 1803248. The test uses [UnixOperatingSystemMXBean#maxFileDescriptorCount|https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html#getMaxFileDescriptorCount()] to check how many file handles are open. Hence would only work on unix setups. The test configures nrt logic to use SimpleFSDirectory which avoids memory mapped file access and hence any open file gets reflected in file descriptor count > NRTIndex leaks file handles due to unclosed IndexReader > --- > > Key: OAK-6500 > URL: https://issues.apache.org/jira/browse/OAK-6500 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.6.0 >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Critical > Fix For: 1.8, 1.7.5, 1.6.4 > > Attachments: OAK-6500-v1.patch > > > On some setups under stress it has been seen that NRTIndex leaks file handles > over time. > Checking with lsof indicates that more than 3 nrt folders per index are being > used. However per design there can be max 3 and after system is not in use > max 1 should be present. > {noformat} > $ lsof -p 9550 | grep '/nrt' | gawk 'match($0, > /.*crx-quickstart\/repository\/index\/(.*?)\/\_.*$/, m) { print m[1]; }' | > sort | uniq > cqPageLucene-1501065263331/nrt1501065335930 > cqPageLucene-1501065263331/nrt1501065374667 > cqPageLucene-1501065263331/nrt1501065392492 > cqPageLucene-1501065263331/nrt1501065440955 > cqPageLucene-1501065263331/nrt1501065473286 > cqPageLucene-1501065263331/nrt1501065507345 > slingeventJob-1501065263330/nrt1501065356975 > slingeventJob-1501065263330/nrt1501065373229 > slingeventJob-1501065263330/nrt1501065394142 > slingeventJob-1501065263330/nrt1501065440953 > slingeventJob-1501065263330/nrt1501065473282 > slingeventJob-1501065263330/nrt1501065507342 > versionStoreIndex-1501065263332/nrt1501065335925 > versionStoreIndex-1501065263332/nrt1501065366781 > versionStoreIndex-1501065263332/nrt1501065392490 > versionStoreIndex-1501065263332/nrt1501065441232 > versionStoreIndex-1501065263332/nrt1501065473285 > {noformat} > Further actually checking index folder indicates that those folder are > actually deleted. So some where the file handle is still referring them. -- This message was sent by Atlassian JIRA (v6.4.14#64029)