[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872140#action_12872140 ] Uwe Schindler commented on LUCENE-2455: --- Should we not add a 3.1 index (created with HEAD 3.x branch) to the TestBackwardsCompatibility? So we can verify that preflex indexes with new CFS header also work? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_trunk.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872149#action_12872149 ] Shai Erera commented on LUCENE-2455: Yes! I'll add them and update the tests. Will post a patch after I get more comments Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_trunk.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872153#action_12872153 ] Shai Erera commented on LUCENE-2455: Hmm ... I've created the indexes using the 3x branch, copied them to trunk and updated TestBackwardsCompatibility to refer to them. All tests pass except for testNumericFields. It fails on both CFS and non-CFS indexes, and so I'm not sure it's related to this issue at all. The failure is this: {code} junit.framework.AssertionFailedError: wrong number of hits expected:1 but was:0 at org.apache.lucene.index.TestBackwardsCompatibility.testNumericFields(TestBackwardsCompatibility.java:773) {code} Can you try to run it on your checkout? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_trunk.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872182#action_12872182 ] Shai Erera commented on LUCENE-2455: Yes - after I updated my checkout and re-create the indexes, the test passes. So I will include them with this patch as well. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: index.31.cfs.zip, index.31.nocfs.zip, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_trunk.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871622#action_12871622 ] Shai Erera commented on LUCENE-2455: Committed revision 948394 (3x). Will now port everything to trunk Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871630#action_12871630 ] Uwe Schindler commented on LUCENE-2455: --- Hi Shai, I have seen this only lately. You added a 3.0 Index ZIP to the tests. This conflicts a little bit with trunk, where a 3.0 Index ZIP is already available. I would prefer to keep the older version ZIPs equal against each release, so it would be fine, if the trunk-added numerics backwards test could also be in 3.x branch. Would this be possible? You have to just merge the code. Also it looks strange that the 3.0 backwards tests now contain also 3.0 index ZIPs, but there is no code for that??? Why have you added this to backwards? The 3.0 backwards tests should only modify this one addindexes test, but not add the zips. Maybe simple delete, they are not used. By the way the 3.0 index zip file generation code is in the 3.0 branch, have you edited it there? You should commit the code there so one is able to regenerate the 3.0 ZIPs from the stable 3.0.x branch. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871631#action_12871631 ] Uwe Schindler commented on LUCENE-2455: --- I looked at the code, it simply tests trhat old indexes can be added. Maybe you just copy the trunk ZIPs for 3.0 to the 3x branch to keep them consistent. The files dont seem to be equal. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871639#action_12871639 ] Shai Erera commented on LUCENE-2455: Ok I added the indexes from trunk (didn't know they were there). I've changed CFS to write a version header in the file, so that's why I've added a 3.0 index - to make sure it can be read properly by 3.1. What I've added to TestBackwardsCompatibility are tests to ensure that addIndexes work on old indexes (which was good, because after the changes they weren't !). bq. Maybe simple delete, they are not used. The testAddIndexes were just added, and the 30 indexes are used. So I cannot delete them (see my comment above) bq. By the way the 3.0 index zip file generation code is in the 3.0 branch, have you edited it there? Nope, it exists in TestBackwardsCompatibility as commented out, w/ instructions to uncomment. I've used that code. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871641#action_12871641 ] Shai Erera commented on LUCENE-2455: While porting the code to trunk, I've noticed that acquireRead/Write, releaseRead/Write, upgradeReadToWrite are either not called anymore, or called in relation to addIndexes. So I think these can be safely removed as well (from 3x and trunk)? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871718#action_12871718 ] Michael McCandless commented on LUCENE-2455: bq. So I think these can be safely removed as well (from 3x and trunk)? I think so! Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871728#action_12871728 ] Shai Erera commented on LUCENE-2455: Committed revision 948415 (copied the 3.0 indexes from trunk) and removed more unnecessary code from IndexWriter. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871075#action_12871075 ] Michael McCandless commented on LUCENE-2455: Could you fix firstInt' to have a very short life? Meaning, you read firstInt, and very quickly use that to assign to version count, and no longer use it again. Ie, all subsequent checks when loading should be against version, not firstInt... Also, can you maybe rename CFW.PRE_VERSION - CFW.FORMAT_PRE_VERSION? (to match the other FORMAT_X). Otherwise looks great! Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871109#action_12871109 ] Shai Erera commented on LUCENE-2455: The only place I see firstInt is used perhaps unnecessarily is in the for-loop. So I've changed the code to look like this: {code} int count, version; if (firstInt CompoundFileWriter.FORMAT_PRE_VERSION) { count = stream.readVInt(); version = firstInt; } else { count = firstInt; version = CompoundFileWriter.FORMAT_PRE_VERSION; } {code} And then I query for version == CompoundFileWriter.FORMAT_PRE_VERSION inside the for-loop. Is that what you meant? There is a check before all that ensuring that read firstInt does not indicate an index corruption -- that should remain as-is, right? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871233#action_12871233 ] Michael McCandless commented on LUCENE-2455: Patch looks good Shai! Thanks. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870548#action_12870548 ] Michael McCandless commented on LUCENE-2455: Patch looks great! So awesome seeing all the -'s in IW.java!! Keep it up :) And it's great that you added 3.0 back compat case to TestBackwardsCompatibility... Some feedback: * Can you change the code to read to a int firstInt instead of version? And make an explicit version (say PRE_VERSION), and then check if version is PRE_VERSION in the code. Ie, any tests against version (eg version 0) should be against constants (version == PRE_VEFRSION) not against 0. * CFW's comment should be make it 1 lower than the current one right? Ie, -2 is the next version? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870549#action_12870549 ] Michael McCandless commented on LUCENE-2455: bq. Backwards support should be much easier there, because we will provide an index migration tool anyway, and so CFW/CFR can always assume they're reading the latest version (at least in 4.0). Hmm I think we should do live migration for this (ie don't require a migration tool to fix your index)? This is trivial to do on the fly right (ie as you've done in 3.x). bq. CFW should probably use CodecUtils in trunk - it cannot be used in 3x because of how CFW works today - writing a VInt first, while CodecUtils assumes an Int. And I don't think it's healthy to do so much changes on 3x. Hmm yeah because of the live migration I think CodecUtils is not actually a fit here (trunk or 3x). Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870688#action_12870688 ] Shai Erera commented on LUCENE-2455: I'm not sure about the live migration, Mike. First because all the problems I've mentioned about CodecUtils in 3x will apply to live migration of 3.x indexes in 4.0 code. Second, if everyone who upgrades to 4.0 will need to run the migration tool, then why do any work in supporting online migration? What's the benefit? Do u think of a case where someone upgrades to 4.0 w/o migrating his indexes (unless he reindexes of course, in which case there is no problem)? I just think it's weird that we support online migration together w/ a migration tool. If we migrate the indexes w/ the tool to include the new format of CFS, then the online migration code won't ever run, right? And not doing this in the tool seems just a waste? I mean the user already migrates his indexes, so why incur the cost of an additional online migration? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870743#action_12870743 ] Michael McCandless commented on LUCENE-2455: bq. With that behind us, did someone start an API migration guide? Not yet, I think? Go for it! Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870761#action_12870761 ] Shai Erera commented on LUCENE-2455: I will document it in CHANGES under API section. I think the migration guide format will need its own discussion, and I don't want to block that issue. When we've agreed on the format (people have made few suggestions), I don't mind helping w/ porting everything relevant from changes to that guide. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869563#action_12869563 ] Shai Erera commented on LUCENE-2455: I've started to implement addIndexes(Directory...) as agreed - copy files from the incoming ones into the local directory, while renaming them on the fly. This works really well with non-CFS segments: a new segment name is generated, the incoming files are renamed and this all flies smoothly (didn't test w/ deletions yet) - even shared doc stores work great. But with CFS it doesn't work well because CFS writes the file names in the CFS file itself, and so even if the segment is renamed to _5 (for example), the names that are written in the file are _2.* (for example), and openInput fails to locate them. To overcome this, I propose we do the following: * Introduce on IndexFileNames a stripName method (3x and trunk) - will return the file name w/o the _x part. * CFR ctor - strip names of read file names by calling IFN.stripName -- 3x only * CFR.openInput - strip name by calling IFN.stripName -- 3x and trunk * Document that files should be created through IFN only -- 3x (for clarity) and trunk (otherwise may not be supported). * Not save the name in CFS -- trunk only. Will remove the need to strip it off when it's read. That will ensure that files are named following a certain convention which we can rely on in CFR. I don't think it's too hard to ask for. CFS itself already knows the name - it's named like it. So there's no value in storing the names of the files it holds. For 3x it should work well b/c we don't allow for custom index files. For trunk we'll ask to go through IFN to name files - so one can create mycustom.file through IFN which will be called _x_mycustom.file. What do you think? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868198#action_12868198 ] Andrzej Bialecki commented on LUCENE-2455: --- I understand - see the edited section in my comment: I think that extracting this non-SR code would be great. I would be in fact glad if there was an easier to control API that allows us to directly stream-process postings / stored / tvf-s / etc. in a way that results in a functioning index. Take for example LUCENE-1812 - the only reason it uses addIndexes(IndexReader) is that there was no easy way to modify postings in a way that would still result in a valid index, and there was no other API to add artificially created postings (i.e. not coming from a Directory) to a target index. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868221#action_12868221 ] Andrzej Bialecki commented on LUCENE-2455: --- bq. So can't the PrunningReader run on the side, converting the postings to whatever they're supposed to look like Erhm ... Currently the only way in the user API to write out existing postings (no matter how created) is to use IndexWriter.addIndexes(IndexReader). We can read postings just fine, using various *Enum classes that we can obtain from IndexReader, but there are no comparable high-level output methods - Codecs and other flex classes are IMHO too low-level. Also, with large indexes the amount of IO/CPU for writing out a Directory and reopening it is non-trivial - it's much more efficient to do this via streaming from the original, already open index. Also, if we remove this method, then FilterIndexReader may as well go too, because it loses its utility. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868277#action_12868277 ] Shai Erera commented on LUCENE-2455: Ok let's keep addIndexes(IndexReader) around. This means though that we cannot simplify the PPP API. We'll still need DirPP. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867870#action_12867870 ] Michael McCandless commented on LUCENE-2455: bq. Question: if we've just added the new SI to segmentInfos, why do we sync on this and check if it exists (when we create the compound file)? Is it because there could be a running merge which will merge it into a new segment before we reach that point? Yes, exactly. bq. What do you think? Is that what you had in mind about merging on the side and committing in the end? Yup! This looks great though I think you should move the docWriter.updateFlushedDocCount into the sync above it? We didn't have to do this before because we blocked all add/updateDocument calls. Also, you shouldn't call docWriter.resumeAllThreads (you didn't pause them). So this change is a great step forward in concurrency of addIndexes(IndexReader...)! Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867871#action_12867871 ] Shai Erera commented on LUCENE-2455: bq. Also, you shouldn't call docWriter.resumeAllThreads (you didn't pause them). Oops, missed that :). Thanks ! I'll replace addIndexes w/ this code and run tests to check how it flies. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867498#action_12867498 ] Shai Erera commented on LUCENE-2455: While changing addIndexes(reader), I've noticed it first obtains read lock and then calls startTransaction(true). In between it calls flush + optimize, which I've removed (as we no longer want to do that). When I ran the tests, TestIndexWriter.testAddIndexesWithThreads failed on the assert in startTransaction about numDocsInRam != 0. That's expected as I no longer call flush. The failure does not occur always. In addIndexes(Dir) flush is called before startTransaction. But it makes sense to do it there, as the local segments are also merged. In the new addIndexes(reader) they won't and so I wonder if: * I shouldn't call startTransaction at all, or * I should, but also call flush before? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867523#action_12867523 ] Michael McCandless commented on LUCENE-2455: Patch looks good Shai! Only a small typo in CHANGES (unles - unless). Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867793#action_12867793 ] Shai Erera commented on LUCENE-2455: In fact, I've create newAddIndexes (just for the review) which works like that: {code} public void newAddIndexes(IndexReader... readers) throws CorruptIndexException, IOException { ensureOpen(); try { String mergedName = newSegmentName(); SegmentMerger merger = new SegmentMerger(this, mergedName, null); for (IndexReader reader : readers) // add new indexes merger.add(reader); int docCount = merger.merge();// merge 'em SegmentInfo info = null; synchronized(this) { info = new SegmentInfo(mergedName, docCount, directory, false, true, -1, null, false, merger.hasProx()); setDiagnostics(info, addIndexes(IndexReader...)); segmentInfos.add(info); } // Notify DocumentsWriter that the flushed count just increased docWriter.updateFlushedDocCount(docCount); // Now create the compound file if needed if (mergePolicy instanceof LogMergePolicy getUseCompoundFile()) { ListString files = null; synchronized(this) { // Must incRef our files so that if another thread // is running merge/optimize, it doesn't delete our // segment's files before we have a chance to // finish making the compound file. if (segmentInfos.contains(info)) { files = info.files(); deleter.incRef(files); } } if (files != null) { try { merger.createCompoundFile(mergedName + .cfs); synchronized(this) { info.setUseCompoundFile(true); } } finally { deleter.decRef(files); } } } } catch (OutOfMemoryError oom) { handleOOM(oom, addIndexes(IndexReader...)); } finally { if (docWriter != null) { docWriter.resumeAllThreads(); } } } {code} Question: if we've just added the new SI to segmentInfos, why do we sync on _this_ and check if it exists (when we create the compound file)? Is it because there could be a running merge which will merge it into a new segment before we reach that point? What do you think? Is that what you had in mind about merging on the side and committing in the end? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Attachments: LUCENE-2455_3x.patch Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866524#action_12866524 ] Michael McCandless commented on LUCENE-2455: bq. But, why wouldn't they be able to use the Directory... version of the method? Adding indexes using FilterIndexReader is useful -- eg look @ how the multi-pass index splitter tool works. bq. What I want is for the resolveExternals to be even faster, plain and shallow resolution. For addIndexes(Directory), assuming the codecs are identical (the write codec equals the codec used to write the external segment), and assuming the doc stores of the external segment are private to it, I think we should be able to do a straight file-level copy, but renaming the segment in the process? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866539#action_12866539 ] Shai Erera commented on LUCENE-2455: bq. Adding indexes using FilterIndexReader is useful I'm not against that Mike. addIndexes should allow for both IndexReader and Directory. It's the registerIndexes (or whatever name we come up with) which should work with Directory only, and then, even if the app calls addIndexes with its own custom IR, it can still call registerIndexes w/ the Directory only, to do that fast copy/registration. Since no IR method will be involved in the process. So let's not confuse the two - addIndexes will exist and work as they are today. registerIndexes will be a new one. bq. assuming the codecs are identical (the write codec equals the codec used to write the external segment), and assuming the doc stores of the external segment are private to it Right. Thanks for pointing that out, as it will become an important NOTE in the documentation. This method (registerIndexes) is definitely for advanced users, that have to know *exactly* what's in the foreign indexes. For example, I need this because I'm building several indexes on several nodes and then I want to add them to a central/master one. I know they don't have deletions, and each is already optimized. Therefore traversing the posting lists (as fast as it would be) is completely unnecessary. bq. but renaming the segment in the process? Sure! I think we should really 'register' them in the Directory, as if they are the newly flushed segments. I'm sure you have a general idea on how this can be done? Assuming through SegmentInfos or something? Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866118#action_12866118 ] Michael McCandless commented on LUCENE-2455: bq. Remove optimize() call from addIndexes(IndexReader...) This still makes me nervous. Yeah it's bad that this method does optimize() now. But if we remove it, it's bad that this method can attempt to do a ridiculously immense merge, since it [naively] just stuffs everything and and does one merge. Ie, both at are bad. Maybe... we could do this: only merge the the incoming IndexReaders, appending a new segment to the end of the index? Ie do no merging whatsoever of the current segments in the index. Yes, this can result in unbalanced segments (ie, a huge segment appears after the long tail of level 0 segments), but, the merge policy can handle this -- it'll work out whatever merges are then necessary to get this segment onto the level that roughly matches its size. bq. So unless you have an IR extension, addDirectories is really the one you should be using. You mean addIndexes(Directory..)? {quote} BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy . {quote} This would be best of all :) But it's tricky, because our MP/MS assume they are working w/ a SegmentInfo. But, maybe it could somehow be made to work -- eg IR does give us maxDoc, numDocs (so we can know del doc count). But eg LogByteSizeMergePolicy goes and computes total byte size of the segment (via SegmentInfo) which we cannot do from an IR. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866245#action_12866245 ] Shai Erera commented on LUCENE-2455: bq. You mean addIndexes(Directory..)? Yes, copy-paste error. bq. Maybe... we could do this: only merge the the incoming IndexReaders, appending a new segment to the end of the index? I like it. IMO, that's what the method should do anyway, for better performance and service to the users. If I'm adding indexes, that doesn't mean I want a whole merge process to kick off. If I want that, I can call maybeMerge or optimize afterwards. Basically, what I would like to add (and I'm not sure it belongs to this issue) is a super fast addIndexes method, something like registerIndexes, which doesn't even traverses the posting lists, removes deleted docs etc. - simply registering the new segments in the Directory. If needed - do a bulk copy of the files and update segments*. Simple as that. Maybe it does fit in that issue, as part of the general house cleaning? I will look more closely into supporting MP + MS w/ addIndexes(readers). Can't promise anything as I learn the code as I go :). Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866302#action_12866302 ] Michael McCandless commented on LUCENE-2455: I agree, addIndexes should be minimal in the work it does... But bulk copy of the files isn't really possible for addIndexes(IR...) in general, since the readers can be arbitrary (eg FilterIndexReader). Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
[ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866444#action_12866444 ] Shai Erera commented on LUCENE-2455: Ok. But since addIndexes(IR) is for IR extensions only, I think the number of people tha will be limited by it is very low. But, why wouldn't they be able to use the Directory... version of the method? Since it's a bulk copy, we don't need IR methods. Maybe just call dir.copyTo or something of that sort? The method will only be asked to copy files (in case they exist elsewhere). I was thinking of introducing just a Directoy version of such method. Basically, if you use NoMP and call addIndexesNoOptimize today, you get half of what I want, as only resolveExternals will be called. What I want is for the resolveExternals to be even faster, plain and shallow resolution. Some house cleaning in addIndexes* -- Key: LUCENE-2455 URL: https://issues.apache.org/jira/browse/LUCENE-2455 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1, 4.0 Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: _After this completes, the index is optimized._ -- optimize() is called in the beginning and not in the end. On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges. After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two: * addIndexes supports IndexReader extensions * addIndexesNoOptimize performs better This issue proposes the following: # Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs. # Rename addIndexesNoOptimize to addIndexes # Remove optimize() call from addIndexes(IndexReader...) # Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern. That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands. BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org