[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-27 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872140#action_12872140
 ] 

Uwe Schindler commented on LUCENE-2455:
---

Should we not add a 3.1 index (created with HEAD 3.x branch) to the 
TestBackwardsCompatibility? So we can verify that preflex indexes with new CFS 
header also work?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_trunk.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872149#action_12872149
 ] 

Shai Erera commented on LUCENE-2455:


Yes! I'll add them and update the tests. Will post a patch after I get more 
comments

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_trunk.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872153#action_12872153
 ] 

Shai Erera commented on LUCENE-2455:


Hmm ... I've created the indexes using the 3x branch, copied them to trunk and 
updated TestBackwardsCompatibility to refer to them. All tests pass except for 
testNumericFields. It fails on both CFS and non-CFS indexes, and so I'm not 
sure it's related to this issue at all. The failure is this:

{code}
junit.framework.AssertionFailedError: wrong number of hits expected:1 but 
was:0
at 
org.apache.lucene.index.TestBackwardsCompatibility.testNumericFields(TestBackwardsCompatibility.java:773)
{code}

Can you try to run it on your checkout?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_trunk.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-27 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872182#action_12872182
 ] 

Shai Erera commented on LUCENE-2455:


Yes - after I updated my checkout and re-create the indexes, the test passes. 
So I will include them with this patch as well.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: index.31.cfs.zip, index.31.nocfs.zip, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_trunk.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871622#action_12871622
 ] 

Shai Erera commented on LUCENE-2455:


Committed revision 948394 (3x).

Will now port everything to trunk

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871630#action_12871630
 ] 

Uwe Schindler commented on LUCENE-2455:
---

Hi Shai,

I have seen this only lately. You added a 3.0 Index ZIP to the tests. This 
conflicts a little bit with trunk, where a 3.0 Index ZIP is already available. 
I would prefer to keep the older version ZIPs equal against each release, so 
it would be fine, if the trunk-added numerics backwards test could also be in 
3.x branch. Would this be possible? You have to just merge the code.

Also it looks strange that the 3.0 backwards tests now contain also 3.0 index 
ZIPs, but there is no code for that??? Why have you added this to backwards? 
The 3.0 backwards tests should only modify this one addindexes test, but not 
add the zips. Maybe simple delete, they are not used.

By the way the 3.0 index zip file generation code is in the 3.0 branch, have 
you edited it there? You should commit the code there so one is able to 
regenerate the 3.0 ZIPs from the stable 3.0.x branch.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871631#action_12871631
 ] 

Uwe Schindler commented on LUCENE-2455:
---

I looked at the code, it simply tests trhat old indexes can be added. Maybe you 
just copy the trunk ZIPs for 3.0 to the 3x branch to keep them consistent. The 
files dont seem to be equal.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871639#action_12871639
 ] 

Shai Erera commented on LUCENE-2455:


Ok I added the indexes from trunk (didn't know they were there). I've changed 
CFS to write a version header in the file, so that's why I've added a 3.0 index 
- to make sure it can be read properly by 3.1. What I've added to 
TestBackwardsCompatibility are tests to ensure that addIndexes work on old 
indexes (which was good, because after the changes they weren't !).

bq. Maybe simple delete, they are not used.

The testAddIndexes were just added, and the 30 indexes are used. So I cannot 
delete them (see my comment above)

bq. By the way the 3.0 index zip file generation code is in the 3.0 branch, 
have you edited it there?

Nope, it exists in TestBackwardsCompatibility as commented out, w/ instructions 
to uncomment. I've used that code.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871641#action_12871641
 ] 

Shai Erera commented on LUCENE-2455:


While porting the code to trunk, I've noticed that acquireRead/Write, 
releaseRead/Write, upgradeReadToWrite are either not called anymore, or called 
in relation to addIndexes. So I think these can be safely removed as well (from 
3x and trunk)?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871718#action_12871718
 ] 

Michael McCandless commented on LUCENE-2455:


bq.  So I think these can be safely removed as well (from 3x and trunk)?

I think so!

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-26 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871728#action_12871728
 ] 

Shai Erera commented on LUCENE-2455:


Committed revision 948415 (copied the 3.0 indexes from trunk) and removed more 
unnecessary code from IndexWriter.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871075#action_12871075
 ] 

Michael McCandless commented on LUCENE-2455:


Could you fix firstInt' to have a very short life?

Meaning, you read firstInt, and very quickly use that to assign to version  
count, and no longer use it again.  Ie, all subsequent checks when loading 
should be against version, not firstInt...

Also, can you maybe rename CFW.PRE_VERSION - CFW.FORMAT_PRE_VERSION?  (to 
match the other FORMAT_X).

Otherwise looks great!

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-25 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871109#action_12871109
 ] 

Shai Erera commented on LUCENE-2455:


The only place I see firstInt is used perhaps unnecessarily is in the for-loop. 
So I've changed the code to look like this:

{code}
int count, version;
if (firstInt  CompoundFileWriter.FORMAT_PRE_VERSION) {
  count = stream.readVInt();
  version = firstInt;
} else {
  count = firstInt;
  version = CompoundFileWriter.FORMAT_PRE_VERSION;
}
{code}

And then I query for version == CompoundFileWriter.FORMAT_PRE_VERSION inside 
the for-loop. Is that what you meant?

There is a check before all that ensuring that read firstInt does not indicate 
an index corruption -- that should remain as-is, right?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871233#action_12871233
 ] 

Michael McCandless commented on LUCENE-2455:


Patch looks good Shai!  Thanks.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870548#action_12870548
 ] 

Michael McCandless commented on LUCENE-2455:


Patch looks great!  So awesome seeing all the -'s in IW.java!!  Keep it up :)

And it's great that you added 3.0 back compat case to
TestBackwardsCompatibility...

Some feedback:

  * Can you change the code to read to a int firstInt instead of
version?  And make an explicit version (say PRE_VERSION), and
then check if version is PRE_VERSION in the code.  Ie, any tests
against version (eg version  0) should be against constants
(version == PRE_VEFRSION) not against 0.

  * CFW's comment should be make it 1 lower than the current one
right?  Ie, -2 is the next version?


 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870549#action_12870549
 ] 

Michael McCandless commented on LUCENE-2455:


bq. Backwards support should be much easier there, because we will provide an 
index migration tool anyway, and so CFW/CFR can always assume they're reading 
the latest version (at least in 4.0).

Hmm I think we should do live migration for this (ie don't require a
migration tool to fix your index)?  This is trivial to do on the fly
right (ie as you've done in 3.x).

bq. CFW should probably use CodecUtils in trunk - it cannot be used in 3x 
because of how CFW works today - writing a VInt first, while CodecUtils assumes 
an Int. And I don't think it's healthy to do so much changes on 3x.

Hmm yeah because of the live migration I think CodecUtils is not
actually a fit here (trunk or 3x).


 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870688#action_12870688
 ] 

Shai Erera commented on LUCENE-2455:


I'm not sure about the live migration, Mike. First because all the problems 
I've mentioned about CodecUtils in 3x will apply to live migration of 3.x 
indexes in 4.0 code. Second, if everyone who upgrades to 4.0 will need to run 
the migration tool, then why do any work in supporting online migration? What's 
the benefit? Do u think of a case where someone upgrades to 4.0 w/o migrating 
his indexes (unless he reindexes of course, in which case there is no problem)?

I just think it's weird that we support online migration together w/ a 
migration tool. If we migrate the indexes w/ the tool to include the new format 
of CFS, then the online migration code won't ever run, right? And not doing 
this in the tool seems just a waste? I mean the user already migrates his 
indexes, so why incur the cost of an additional online migration?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870743#action_12870743
 ] 

Michael McCandless commented on LUCENE-2455:


bq. With that behind us, did someone start an API migration guide?

Not yet, I think?  Go for it!

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870761#action_12870761
 ] 

Shai Erera commented on LUCENE-2455:


I will document it in CHANGES under API section. I think the migration guide 
format will need its own discussion, and I don't want to block that issue. When 
we've agreed on the format (people have made few suggestions), I don't mind 
helping w/ porting everything relevant from changes to that guide.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch, LUCENE-2455_3x.patch, 
 LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-20 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869563#action_12869563
 ] 

Shai Erera commented on LUCENE-2455:


I've started to implement addIndexes(Directory...) as agreed - copy files from 
the incoming ones into the local directory, while renaming them on the fly. 
This works really well with non-CFS segments: a new segment name is generated, 
the incoming files are renamed and this all flies smoothly (didn't test w/ 
deletions yet) - even shared doc stores work great.

But with CFS it doesn't work well because CFS writes the file names in the CFS 
file itself, and so even if the segment is renamed to _5 (for example), the 
names that are written in the file are _2.* (for example), and openInput fails 
to locate them. To overcome this, I propose we do the following:

* Introduce on IndexFileNames a stripName method (3x and trunk) - will return 
the file name w/o the _x part.
* CFR ctor - strip names of read file names by calling IFN.stripName -- 3x only
* CFR.openInput - strip name by calling IFN.stripName -- 3x and trunk
* Document that files should be created through IFN only -- 3x (for clarity) 
and trunk (otherwise may not be supported).
* Not save the name in CFS -- trunk only. Will remove the need to strip it off 
when it's read.

That will ensure that files are named following a certain convention which we 
can rely on in CFR. I don't think it's too hard to ask for. CFS itself already 
knows the name - it's named like it. So there's no value in storing the names 
of the files it holds.

For 3x it should work well b/c we don't allow for custom index files. For trunk 
we'll ask to go through IFN to name files - so one can create mycustom.file 
through IFN which will be called _x_mycustom.file.

What do you think?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868198#action_12868198
 ] 

Andrzej Bialecki  commented on LUCENE-2455:
---

I understand - see the edited section in my comment: I think that extracting 
this non-SR code would be great. I would be in fact glad if there was an easier 
to control API that allows us to directly stream-process postings / stored / 
tvf-s / etc. in a way that results in a functioning index. Take for example 
LUCENE-1812 - the only reason it uses addIndexes(IndexReader) is that there was 
no easy way to modify postings in a way that would still result in a valid 
index, and there was no other API to add artificially created postings (i.e. 
not coming from a Directory) to a target index.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868221#action_12868221
 ] 

Andrzej Bialecki  commented on LUCENE-2455:
---

bq. So can't the PrunningReader run on the side, converting the postings to 
whatever they're supposed to look like 

Erhm ... Currently the only way in the user API to write out existing postings 
(no matter how created) is to use IndexWriter.addIndexes(IndexReader). We can 
read postings just fine, using various *Enum classes that we can obtain from 
IndexReader, but there are no comparable high-level output methods  - Codecs 
and other flex classes are IMHO too low-level.

Also, with large indexes the amount of IO/CPU for writing out a Directory and 
reopening it is non-trivial - it's much more efficient to do this via streaming 
from the original, already open index.

Also, if we remove this method, then FilterIndexReader may as well go too, 
because it loses its utility.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868277#action_12868277
 ] 

Shai Erera commented on LUCENE-2455:


Ok let's keep addIndexes(IndexReader) around. This means though that we cannot 
simplify the PPP API. We'll still need DirPP.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867870#action_12867870
 ] 

Michael McCandless commented on LUCENE-2455:


bq. Question: if we've just added the new SI to segmentInfos, why do we sync on 
this and check if it exists (when we create the compound file)? Is it because 
there could be a running merge which will merge it into a new segment before we 
reach that point?

Yes, exactly.

bq. What do you think? Is that what you had in mind about merging on the side 
and committing in the end?

Yup!  This looks great though I think you should move the 
docWriter.updateFlushedDocCount into the sync above it?  We didn't have to do 
this before because we blocked all add/updateDocument calls.

Also, you shouldn't call docWriter.resumeAllThreads (you didn't pause them).

So this change is a great step forward in concurrency of 
addIndexes(IndexReader...)!

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867871#action_12867871
 ] 

Shai Erera commented on LUCENE-2455:


bq. Also, you shouldn't call docWriter.resumeAllThreads (you didn't pause them).

Oops, missed that :). Thanks !

I'll replace addIndexes w/ this code and run tests to check how it flies.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867498#action_12867498
 ] 

Shai Erera commented on LUCENE-2455:


While changing addIndexes(reader), I've noticed it first obtains read lock and 
then calls startTransaction(true). In between it calls flush + optimize, which 
I've removed (as we no longer want to do that). When I ran the tests, 
TestIndexWriter.testAddIndexesWithThreads failed on the assert in 
startTransaction about numDocsInRam != 0. That's expected as I no longer call 
flush. The failure does not occur always.

In addIndexes(Dir) flush is called before startTransaction. But it makes sense 
to do it there, as the local segments are also merged. In the new 
addIndexes(reader) they won't and so I wonder if:
* I shouldn't call startTransaction at all, or
* I should, but also call flush before?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867523#action_12867523
 ] 

Michael McCandless commented on LUCENE-2455:


Patch looks good Shai!  Only a small typo in CHANGES (unles - unless).

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867793#action_12867793
 ] 

Shai Erera commented on LUCENE-2455:


In fact, I've create newAddIndexes (just for the review) which works like that:
{code}
  public void newAddIndexes(IndexReader... readers) throws 
CorruptIndexException, IOException {

ensureOpen();

try {
  String mergedName = newSegmentName();
  SegmentMerger merger = new SegmentMerger(this, mergedName, null);
  
  for (IndexReader reader : readers)  // add new indexes
merger.add(reader);
  
  int docCount = merger.merge();// merge 'em
  
  SegmentInfo info = null;
  synchronized(this) {
info = new SegmentInfo(mergedName, docCount, directory, false, true,
-1, null, false, merger.hasProx());
setDiagnostics(info, addIndexes(IndexReader...));
segmentInfos.add(info);
  }
  
  // Notify DocumentsWriter that the flushed count just increased
  docWriter.updateFlushedDocCount(docCount);
  
  // Now create the compound file if needed
  if (mergePolicy instanceof LogMergePolicy  getUseCompoundFile()) {

ListString files = null;

synchronized(this) {
  // Must incRef our files so that if another thread
  // is running merge/optimize, it doesn't delete our
  // segment's files before we have a chance to
  // finish making the compound file.
  if (segmentInfos.contains(info)) {
files = info.files();
deleter.incRef(files);
  }
}

if (files != null) {
  try {
merger.createCompoundFile(mergedName + .cfs);
synchronized(this) {
  info.setUseCompoundFile(true);
}
  } finally {
deleter.decRef(files);
  }
}
  }
} catch (OutOfMemoryError oom) {
  handleOOM(oom, addIndexes(IndexReader...));
} finally {
  if (docWriter != null) {
docWriter.resumeAllThreads();
  }
}
  }
{code}

Question: if we've just added the new SI to segmentInfos, why do we sync on 
_this_ and check if it exists (when we create the compound file)? Is it because 
there could be a running merge which will merge it into a new segment before we 
reach that point?

What do you think? Is that what you had in mind about merging on the side and 
committing in the end?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2455_3x.patch


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this 

[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866524#action_12866524
 ] 

Michael McCandless commented on LUCENE-2455:


bq. But, why wouldn't they be able to use the Directory... version of the 
method?

Adding indexes using FilterIndexReader is useful -- eg look @ how the 
multi-pass index splitter tool works.

bq. What I want is for the resolveExternals to be even faster, plain and 
shallow resolution.

For addIndexes(Directory), assuming the codecs are identical (the write codec 
equals the codec used to write the external segment), and assuming the doc 
stores of the external segment are private to it, I think we should be able to 
do a straight file-level copy, but renaming the segment in the process?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866539#action_12866539
 ] 

Shai Erera commented on LUCENE-2455:


bq. Adding indexes using FilterIndexReader is useful 

I'm not against that Mike. addIndexes should allow for both IndexReader and 
Directory. It's the registerIndexes (or whatever name we come up with) which 
should work with Directory only, and then, even if the app calls addIndexes 
with its own custom IR, it can still call registerIndexes w/ the Directory 
only, to do that fast copy/registration. Since no IR method will be involved in 
the process.

So let's not confuse the two - addIndexes will exist and work as they are 
today. registerIndexes will be a new one.

bq. assuming the codecs are identical (the write codec equals the codec used 
to write the external segment), and assuming the doc stores of the external 
segment are private to it

Right. Thanks for pointing that out, as it will become an important NOTE in the 
documentation. This method (registerIndexes) is definitely for advanced users, 
that have to know *exactly* what's in the foreign indexes. For example, I need 
this because I'm building several indexes on several nodes and then I want to 
add them to a central/master one. I know they don't have deletions, and each is 
already optimized. Therefore traversing the posting lists (as fast as it would 
be) is completely unnecessary.

bq. but renaming the segment in the process?

Sure! I think we should really 'register' them in the Directory, as if they are 
the newly flushed segments. I'm sure you have a general idea on how this can be 
done? Assuming through SegmentInfos or something?

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866118#action_12866118
 ] 

Michael McCandless commented on LUCENE-2455:


bq. Remove optimize() call from addIndexes(IndexReader...)

This still makes me nervous.  Yeah it's bad that this method does optimize() 
now.  But if we remove it, it's bad that this method can attempt to do a 
ridiculously immense merge, since it [naively] just stuffs everything and and 
does one merge.  Ie, both at are bad.

Maybe... we could do this: only merge the the incoming IndexReaders, appending 
a new segment to the end of the index?  Ie do no merging whatsoever of the 
current segments in the index.

Yes, this can result in unbalanced segments (ie, a huge segment appears after 
the long tail of level 0 segments), but, the merge policy can handle this -- 
it'll work out whatever merges are then necessary to get this segment onto the 
level that roughly matches its size. 

bq. So unless you have an IR extension, addDirectories is really the one  you 
should be using.

You mean addIndexes(Directory..)?

{quote}
BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
nor MergePolicy, but rather call SegmentMerger directly. This might be 
another place for improvement. I'll look into it, and if it's not too 
complicated, I may cover it by this issue as well. If you have any hints 
that can give me a good head start on that, please don't be shy .
{quote}

This would be best of all :)  But it's tricky, because our MP/MS assume they 
are working w/ a SegmentInfo.  But, maybe it could somehow be made to work -- 
eg IR does give us maxDoc, numDocs (so we can know del doc count).  But eg 
LogByteSizeMergePolicy goes and computes total byte size of the segment (via 
SegmentInfo) which we cannot do from an IR.



 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866245#action_12866245
 ] 

Shai Erera commented on LUCENE-2455:


bq. You mean addIndexes(Directory..)?

Yes, copy-paste error.

bq. Maybe... we could do this: only merge the the incoming IndexReaders, 
appending a new segment to the end of the index?

I like it. IMO, that's what the method should do anyway, for better performance 
and service to the users. If I'm adding indexes, that doesn't mean I want a 
whole merge process to kick off. If I want that, I can call maybeMerge or 
optimize afterwards.

Basically, what I would like to add (and I'm not sure it belongs to this issue) 
is a super fast addIndexes method, something like registerIndexes, which 
doesn't even traverses the posting lists, removes deleted docs etc. - simply 
registering the new segments in the Directory. If needed - do a bulk copy of 
the files and update segments*. Simple as that. Maybe it does fit in that 
issue, as part of the general house cleaning?

I will look more closely into supporting MP + MS w/ addIndexes(readers). Can't 
promise anything as I learn the code as I go :).

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866302#action_12866302
 ] 

Michael McCandless commented on LUCENE-2455:


I agree, addIndexes should be minimal in the work it does...

But bulk copy of the files isn't really possible for addIndexes(IR...) in 
general, since the readers can be arbitrary (eg FilterIndexReader).

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

2010-05-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866444#action_12866444
 ] 

Shai Erera commented on LUCENE-2455:


Ok. But since addIndexes(IR) is for IR extensions only, I think the number of 
people tha will be limited by it is very low.

But, why wouldn't they be able to use the Directory... version of the method? 
Since it's a bulk copy, we don't need IR methods. Maybe just call dir.copyTo or 
something of that sort? The method will only be asked to copy files (in case 
they exist elsewhere). I was thinking of introducing just a Directoy version of 
such method.

Basically, if you use NoMP and call addIndexesNoOptimize today, you get half of 
what I want, as only resolveExternals will be called. What I want is for the 
resolveExternals to be even faster, plain and shallow resolution.

 Some house cleaning in addIndexes*
 --

 Key: LUCENE-2455
 URL: https://issues.apache.org/jira/browse/LUCENE-2455
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Trivial
 Fix For: 3.1, 4.0


 Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
 especially on when to invoke each. Also, addIndexes calls optimize() in 
 the beginning, but only on the target index. It also includes the 
 following jdoc statement, which from how I understand the code, is 
 wrong: _After this completes, the index is optimized._ -- optimize() is 
 called in the beginning and not in the end. 
 On the other hand, addIndexesNoOptimize does not call optimize(), and 
 relies on the MergeScheduler and MergePolicy to handle the merges. 
 After a short discussion about that on the list (Thanks Mike for the 
 clarifications!) I understand that there are really two core differences 
 between the two: 
 * addIndexes supports IndexReader extensions
 * addIndexesNoOptimize performs better
 This issue proposes the following:
 # Clear up the documentation of each, spelling out the pros/cons of 
   calling them clearly in the javadocs.
 # Rename addIndexesNoOptimize to addIndexes
 # Remove optimize() call from addIndexes(IndexReader...)
 # Document that clearly in both, w/ a recommendation to call optimize() 
   before on any of the Directories/Indexes if it's a concern. 
 That way, we maintain all the flexibility in the API - 
 addIndexes(IndexReader...) allows for using IR extensions, 
 addIndexes(Directory...) is considered more efficient, by allowing the 
 merges to happen concurrently (depending on MS) and also factors in the 
 MP. So unless you have an IR extension, addDirectories is really the one 
 you should be using. And you have the freedom to call optimize() before 
 each if you care about it, or don't if you don't care. Either way, 
 incurring the cost of optimize() is entirely in the user's hands. 
 BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
 nor MergePolicy, but rather call SegmentMerger directly. This might be 
 another place for improvement. I'll look into it, and if it's not too 
 complicated, I may cover it by this issue as well. If you have any hints 
 that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org