[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034691#comment-13034691 ] Michael McCandless commented on LUCENE-1421: I adding grouping queries to the nightly benchmarks (http://people.apache.org/~mikemccand/lucenebench) -- see TermGroup100/10K/1M. The F annotation is the day grouping queries first ran. Those queries are the same queries running as TermQuery, just with grouping turned on on 3 randomly generated fields, with 100, 10,000 and 1 million unique values. So we can gauge the perf hit by comparing to TermQuery each night. I use the CachingCollector. First off, I'm impressed that the perf hit for grouping is not too bad: ||Query||QPS||Slowdown|| |TermQuery (baseline)|30.72|0| |TermGroup100|13.59|2.26| |TermQuery10K|13.2|2.34| |TermQuery1M|12.15|2.53| I had expected we'd pay a bigger perf hit! Second, there more unique groups you have, the slower grouping gets, but that multiplier really isn't so bad -- the 1M unique groups case is only 10.6% slower than the 100 unique groups case. Remember, though, that these groups are randomly generated full-unicode strings, so real data could very well produce different results... Third, and this is insanity, the addition of grouping caused other unexpected changes. Most horribly, SpanNearQuery slowed down by ~12.2% (http://people.apache.org/~mikemccand/lucenebench/SpanNear.html), while other queries seem to get a bit faster. I think this is [frustratingly!] due to hotspot making different decisions about which code to optimize/inline. Similarly strange, when I added sorting (TermQuery sorting by title and date/time, E annotation in all graphs), I saw the variance in the unsorted TermQuery performance drop substantially. I'm pretty sure this wide variance was due to hotspot's erratic decision making, but somehow the addition of sorting, while not change TermQuery's mean QPS, caused hotspot to at least be somewhat more consistent in how it compiled the code. Maybe as we add more and more diverse queries to the benchmark we'll see hotspot behave more reasonably Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034714#comment-13034714 ] Martijn van Groningen commented on LUCENE-1421: --- bq. I adding grouping queries to the nightly benchmarks Nice! Are the regular sort and group sort different in these test cases? Do think when new features are added that these also need be added to this test suite? Or is this perfomance test suite just for the basic features? Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034828#comment-13034828 ] Michael McCandless commented on LUCENE-1421: I'm only testing groupSort and sort by relevance now in the nightly bench. I'll add sort-by-title, groupSort-by-relevance cases too, so we test that. Hmm, though: this content set is alphabetized by title I believe, so it's not really a good test. (I suspect that's why the TermQuery sorting by title is faster bq. Do think when new features are added that these also need be added to this test suite? Or is this perfomance test suite just for the basic features? Well, in general I'd love to have wider coverage in the nightly perf test... really it's only a start now. But there's no hard rule we have to add new functions into the nightly bench... Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: core/search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033506#comment-13033506 ] Martijn van Groningen commented on LUCENE-1421: --- Michael I see you have committed it to the trunk. Nice work! Only one quest why is the SearchGroup class now package protected? For me the documentation in overview.html suggest that I can just use it in any package. As for porting this code to the 3x branch I see that this branch doesn't have modules. Does it mean that it will be a Lucene contrib? Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032350#comment-13032350 ] Martijn van Groningen commented on LUCENE-1421: --- bq. But what if docid=2 had age=17 instead? How would we determine what value the group (for hgid=1) should have for the age field? That would depend on the group sort, right? If your group sort is age asc the lowest document would be chosen. bq. Or... would the group count +1 to age=10 and +1 to age=17 in that case? You mean like a sum of all ages per group? That is interesting, but sounds more like a function to me. This can be computed with a separated group collector. Wouldn't make sense to me, to have this with a regular field facet. Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032380#comment-13032380 ] Michael McCandless commented on LUCENE-1421: {quote} bq. But what if docid=2 had age=17 instead? How would we determine what value the group (for hgid=1) should have for the age field? That would depend on the group sort, right? If your group sort is age asc the lowest document would be chosen. {quote} OK, I see. So the group is represented by the doc within it that sorts highest according to the group sort, and any faceting on the groups means faceting on that top doc's values, per group. Neat. {quote} bq. Or... would the group count +1 to age=10 and +1 to age=17 in that case? You mean like a sum of all ages per group? That is interesting, but sounds more like a function to me. This can be computed with a separated group collector. Wouldn't make sense to me, to have this with a regular field facet. {quote} Well, not sum, but multi-valued? (Ie, as if this group were represented by a doc that takes the union of all age values of docs within it). This way, if the user then does a drill-down by a specific age, the number of groups then returned would match the facet count of that age in the first query. I agree we need to hash out these semantics :) Bill could you open a separate Lucene issue, to work out the semantics impl for post-grouping-faceting? Unfortunately, it's blocked by factoring out the facet module (LUCENE-3079). Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032082#comment-13032082 ] Martijn van Groningen commented on LUCENE-1421: --- Nice work Michael! I also think that the two pass mechanism is definitely the preferred way to go. I think we also need a strategy mechanism (or at least an GroupCollector class hierarchy) inside this module. The mechanism should select the right group collector(s) for a certain request. Some users maybe only care about the top group document, so I second pass won't be necessary. Another example with faceting in mind. When group based faceting is necessary. The top N groups don't suffice. You'll need all group docs (I currently don't see a other way). These groups docs are then used to create a grouped Solr DocSet. But this should be a completely different implementation. Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032145#comment-13032145 ] Michael McCandless commented on LUCENE-1421: {quote} I think we also need a strategy mechanism (or at least an GroupCollector class hierarchy) inside this module. The mechanism should select the right group collector(s) for a certain request. Some users maybe only care about the top group document, so I second pass won't be necessary. Another example with faceting in mind. When group based faceting is necessary. The top N groups don't suffice. You'll need all group docs (I currently don't see a other way). These groups docs are then used to create a grouped Solr DocSet. But this should be a completely different implementation. {quote} I agree, there's much more we could do here! Specialized collection for the maxDocsPerGroup=1 case, and for the I want all groups case, would be nice. For the not many unique values in the group field case we could do a single-pass collector, I think. Grouping by a multi-valued field should be possible (we now have DocTermOrds in Lucene, but it doesn't load the term byte[] data), as well as support for sharding, ie, by merging top groups and docs w/in each group (but I think we need an addition to FieldComparator API for this). I think we should commit this starting point, today, and then iterate from there... Martijn, thank you for persisting for so long on SOLR-236! We are finally getting grouping functionality accessible from Lucene and Solr... Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032238#comment-13032238 ] Bill Bell commented on LUCENE-1421: --- Say we have 4 documents: docid=1 hgid=1 age=10 docid=2 hgid=1 age=10 docid=3 hgid=2 age=12 docid=4 hgid=4 age=11 If we group by hgid, we would get: hgid=1 docid=1 hgid=1 age=10 docid=2 hgid=1 age=10 hgid=3 docid=3 hgid=2 age=12 hgid=4 docid=4 hgid=4 age=11 If I set Facet Counts = POST age: 10 (1 document) age: 11 (1 document) age: 12 (1 document) If I set Facet Counts = PRE age: 10 (2 document) age: 11 (1 document) age: 12 (1 document) The only way grouping works in Solr now is Facet Counts = PRE. Thanks. Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031266#comment-13031266 ] Michael McCandless commented on LUCENE-1421: bq. I think that grouping code should be part of Lucene instead of Solr. +1 This is a very popular issue (currently tied for 2nd place in votes). Unfortunately, I think the single-pass collector attached here doesn't scale very well to large maxDoc and/or large number of unique groups. Also, it pulls a DocTermsIndex on the top-level reader (costly in an NRT/reopen setting since it's not per-segment). So I decided to factor out parts of Solr's current two-pass approach into a shared grouping module. The downside of the two-pass approach is you run the query twice, automatically halving your QPS. (It's even worse because the grouping itself is somewhat computing intensive too). To try to help mitigate this, I also added a new CachingCollector, which just holds hits (docID and optionally score) up to a max allowed RAM consumption, and can then replay them for the 2nd pass. In includes a max RAM setting so that if too many hits are found, it stops caching (and you must then re-execute the query). But one nice side effect of the two-phased approach is that sharding is in theory straightforward (I think?). Ie, all shards would do the first phase, concurrently, to get the top N groups. Then you merge-sort the resulting top groups, then run second phase (finding docs w/in the top groups) on all shards, then merge results from the same group across all shards. Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Priority: Minor Attachments: lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1421) Ability to group search results by field
[ https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031537#comment-13031537 ] Bill Bell commented on LUCENE-1421: --- The issue I have with the group=true feature in Solr is that the facets are not calculated post grouping. So I cannot show the (count) in the facet list for a field. If we can get the facets to return counts POST grouping that would be ideal. Bill Ability to group search results by field Key: LUCENE-1421 URL: https://issues.apache.org/jira/browse/LUCENE-1421 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Artyom Sokolov Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-1421.patch, lucene-grouping.patch It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on. Thanks, Artyom -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org