[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035637#comment-13035637 ] Michael McCandless commented on LUCENE-3102: Thanks Shai -- this is awesome progress! > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, > LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, > LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035344#comment-13035344 ] Shai Erera commented on LUCENE-3102: bq. The committed CHANGES has typo (reply should be replay). Thanks, will include it in the next commit. bq. I'd rather have it cutover at some point This can only be done if out-of-order collection wasn't done so far, because otherwise, cutting to OBS will take cached doc IDs and scores out of sync. bq. we make a bit set impl that does this all under the hood (uses int[] when there are few docs, and "ugprades" to OBS once there are "enough" to justify it...) That's a good idea. I think we should leave the OBS stuff for another issue. See first how this performs and optimize only if needed. I'll take a look at TestGrouping. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, > LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035285#comment-13035285 ] Michael McCandless commented on LUCENE-3102: Patch to allow no wrapped collector looks good! I wonder/hope hotspot can realize those method calls are no-ops... Maybe change TestGrouping to randomly use this ctor? Ie, randomly, you can use caching collector (not wrapped), then call its replay method twice (once against 1st pass, then against 2nd pass, collectors), and then assert results like normal. This is also a good verification that replay works twice... On the OBS, it makes me nervous to just always do this; I'd rather have it cutover at some point? Or perhaps it's an expert optional arg to create, whether it should back w/ OBS vs int[]? Or, ideally... we make a bit set impl that does this all under the hood (uses int[] when there are few docs, and "ugprades" to OBS once there are "enough" to justify it...), then we can just use that bit set here. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, > LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035283#comment-13035283 ] Michael McCandless commented on LUCENE-3102: The committed CHANGES has typo (reply should be replay). > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, > LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035223#comment-13035223 ] Shai Erera commented on LUCENE-3102: There are two things left to do: (1) Use bit set instead of int[] for docIDs. If we do this, then it means the Collector cannot support out-of-order collections (which is not a big deal IMO). It also means for large indexes, we might consume more RAM than int[]. (2) Allow this Collector to stand on its own, w/o necessarily wrapping another Collector. There are several ways we can achieve that: * Take a 'null' Collector and check other != null. Adds an 'if' but not a big deal IMO. Also, acceptDocsOutOfOrder will have to either return false (or true), or we take that as a parameter. * Take a 'null' Collector and set this.other to a private static instance of a NoOpCollector. We'll still be delegating calls to it, but hopefully it won't be expensive. Same issue w/ out-of-order * Create two specialized variants of CachingCollector. Personally I'm not too much in favor of the last option - too much code dup for not much gain. The option I like the most is the 2nd (introducing a NoOpCollector). We can even introduce it as a public static member of CachingCollector and let users decide if they want to use it or not. For ease of use, we can still allow 'null' to be passed to create(). What do you think? > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, > LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035188#comment-13035188 ] Shai Erera commented on LUCENE-3102: Committed revision 1104680 (3x). Committed revision 1104683 (trunk). > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, > LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034841#comment-13034841 ] Michael McCandless commented on LUCENE-3102: Patch looks great! But, can we rename curupto -> curUpto (and same for curbase)? Ie, so it matches the other camelCaseVariables we have here... Thank you! > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: core/search >Reporter: Shai Erera >Assignee: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, > LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034283#comment-13034283 ] Shai Erera commented on LUCENE-3102: Committed revision 1103870 (3x). Committed revision 1103872 (trunk). What's committed: * Move CachingCollector to core * Fix bugs * Add TestCachingCollector * Some refactoring Moving on to next proposed changes. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: modules/grouping >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034091#comment-13034091 ] Michael McCandless commented on LUCENE-3102: Patch looks great Shai -- +1 to commit!! Yes that is very sneaky about the private fields in inner/outer classes -- it's good you added a comment explaining it! > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033992#comment-13033992 ] Michael McCandless commented on LUCENE-3102: Patch looks awesome Shai! Only thing is: I would be careful about directly setting those private fields of the cachedScorer; I think (not sure) this incurs an "access" check on each assignment. Maybe make them package protected? Or use a setter? bq. Question – perhaps we can commit these changes incrementally? +1 - progress not perfection! These changes are great. {quote} Mike, there is another reason to separate Collector.needsScores() from cacheScores – it is possible someone will pass a Collector which needs scores, however won't want to have CachingCollector 'cache' them. In which case, the wrapped Collector should be delegated setScorer instead of cachedScorer. {quote} Ahh good point, because the 2nd pass collector may not need the scores. So on the first pass we'd have to forward the .score() request to the wrapped collector but not cache it. bq. I will leave Collector.needsScores() for a different issue though? +1 Thanks for iterating on this Shai! > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033788#comment-13033788 ] Shai Erera commented on LUCENE-3102: bq. I will start w/ "svn mv" this to core Or, we can iterate on all the changes here, then do the svn move as part of the commit. Both work for me. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033787#comment-13033787 ] Shai Erera commented on LUCENE-3102: bq. Hmm, I think it does clear cachedScores? (But not cachedSegs). Sorry, I meant curScores. bq. +1 (on move to core) I will start w/ "svn mv" this to core, so that later patches on this issue will be applied easily. Moving to core has nothing to do w/ resolving the other issues. bq. we could also "upgrade" to a bit set part way through, since it's so clear where the cutoff is you're right, but cutting off to OBS is dangerous, b/c by doing that we: # Suddenly halt search when we create and populate OBS # Lose the ability to support out-of-order docs (in fact, depending on the mode and how the query was executed so far, we might not even be able to do the cut-off at all). So I prefer that we make that decision up front, perhaps through another parameter to the factory method. bq. but eg you could mess up (pass cacheScores = false but then pass a collector that calls .score()) Oh I see, so this TODO is about the use of cachedScorer (vs. just delegating setScorer to other). I agree. BTW, this version of cachedScorer is very optimized and clean, but we do have ScoreCachingWrappingScorer which achieves the same goal, only w/ 1-2 more 'if'. Perhaps we should reuse it? But then again, for the purpose of this Collector, cachedScorer is the most optimized it can be. bq. ie, I don't want to make it easy to be unlimited? It seems too dangerous... I'd rather your code has to spell out 10*1024 so you realize you're saying 10 GB (for example). What if you run w/ 16GB Heap? ;) But ok, I don't mind, we can spell it out clearly in the jdocs. bq. Are you planning to work up a patch for these...? I think so. I'll try to squeeze it in my schedule in the next couple of days. If I see I don't get to it, I'll update the issue. > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"? I > think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if > scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific > to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to > cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can > use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another > Collector? We have such Collector which stores (in-memory) all matching doc > IDs + scores (if required). Those are later fed into several processes that > operate on them (e.g. fetch more info from the index etc.). I am thinking, we > can make CachingCollector *optionally* wrap another Collector and then > someone can reuse it by setting RAM limit to unlimited (we should have a > constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033784#comment-13033784 ] Michael McCandless commented on LUCENE-3102: Great catches Shai -- thanks for the thorough review! bq. Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. Ahh... maybe we throw IllegalArgExc if the replay'd collector requires in-order but the first pass collector didn't? bq. It does not clear cachedScores + cachedSegs upon exceeding RAM limits Hmm, I think it does clear cachedScores? (But not cachedSegs). bq. I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity That sounds great! bq. This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Oh you mean for the last "chunk" we could alloc right up to the limit? Good! bq. The TODO in line 64 (having Collector specify needsScores()) – why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly? Yes, I think we should keep the explicit boolean (cacheScores), but eg you could mess up (pass cacheScores = false but then pass a collector that calls .score()) -- that's why I added to TODO. Ie, it'd be nice if we could "verify" that the collector agrees we don't need scores. I think there were other places in Lucene where knowing this up front could help us... can't remember the details. bq. Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) +1 bq. I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? +1 {quote} How about using OpenBitSet instead of int[] for doc IDs? If the number of hits is big, we'd gain some RAM back, and be able to cache more entries NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order {quote} Hmm but if the number of hits is small we spend un-needed RAM/CPU, but, then that tradeoff is maybe OK? I'm just worried about indices w/ lots of docs... we could also "upgrade" to a bit set part way through, since it's so clear where the cutoff is. bq. Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector optionally wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. I'd actually rather not have the constant -- ie, I don't want to make it easy to be unlimited? It seems too dangerous... I'd rather your code has to spell out 10*1024 so you realize you're saying 10 GB (for example). bq. I think a set of dedicated unit tests for this class alone would be good. +1 Awesome feedback!! Are you planning to work up a patch for these...? > Few issues with CachingCollector > > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Reporter: Shai Erera >Priority: Minor > Fix For: 3.2, 4.0 > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the > document IDs cached may be out-of-order (depends on the Query) and thus > replay(Collector) will forward document IDs out-of-order to a Collector that > may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine > if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be > relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the > maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want > to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we > need that if CachingCollector ctor already takes a boolean "cacheScores"