[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034691#comment-13034691
 ] 

Michael McCandless commented on LUCENE-1421:


I adding grouping queries to the nightly benchmarks
(http://people.apache.org/~mikemccand/lucenebench) -- see
TermGroup100/10K/1M.  The F annotation is the day grouping queries
first ran.

Those queries are the same queries running as TermQuery, just with
grouping turned on on 3 randomly generated fields, with 100, 10,000
and 1 million unique values.  So we can gauge the perf hit by
comparing to TermQuery each night.

I use the CachingCollector.

First off, I'm impressed that the perf hit for grouping is not too
bad:

||Query||QPS||Slowdown||
|TermQuery (baseline)|30.72|0|
|TermGroup100|13.59|2.26|
|TermQuery10K|13.2|2.34|
|TermQuery1M|12.15|2.53|

I had expected we'd pay a bigger perf hit!

Second, there more unique groups you have, the slower grouping gets,
but that multiplier really isn't so bad -- the 1M unique groups case
is only 10.6% slower than the 100 unique groups case.

Remember, though, that these groups are randomly generated
full-unicode strings, so real data could very well produce different
results...

Third, and this is insanity, the addition of grouping caused other
unexpected changes.  Most horribly, SpanNearQuery slowed down
by ~12.2%
(http://people.apache.org/~mikemccand/lucenebench/SpanNear.html),
while other queries seem to get a bit faster.  I think this is
[frustratingly!] due to hotspot making different decisions about which
code to optimize/inline.

Similarly strange, when I added sorting (TermQuery sorting by title
and date/time, E annotation in all graphs), I saw the variance in
the unsorted TermQuery performance drop substantially.  I'm pretty
sure this wide variance was due to hotspot's erratic decision making,
but somehow the addition of sorting, while not change TermQuery's mean
QPS, caused hotspot to at least be somewhat more consistent in how it
compiled the code.  Maybe as we add more and more diverse queries to
the benchmark we'll see hotspot behave more reasonably


 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-17 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034714#comment-13034714
 ] 

Martijn van Groningen commented on LUCENE-1421:
---

bq. I adding grouping queries to the nightly benchmarks
Nice!

Are the regular sort and group sort different in these test cases?

Do think when new features are added that these also need be added to this test 
suite? Or is this perfomance test suite just for the basic features?

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034828#comment-13034828
 ] 

Michael McCandless commented on LUCENE-1421:


I'm only testing groupSort and sort by relevance now in the nightly bench.

I'll add sort-by-title, groupSort-by-relevance cases too, so we test that.  
Hmm, though: this content set is alphabetized by title I believe, so it's not 
really a good test.  (I suspect that's why the TermQuery sorting by title is 
faster 

bq. Do think when new features are added that these also need be added to this 
test suite? Or is this perfomance test suite just for the basic features?

Well, in general I'd love to have wider coverage in the nightly perf test...  
really it's only a start now.  But there's no hard rule we have to add new 
functions into the nightly bench...

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-14 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033506#comment-13033506
 ] 

Martijn van Groningen commented on LUCENE-1421:
---

Michael I see you have committed it to the trunk. Nice work!
Only one quest why is the SearchGroup class now package protected? For me the 
documentation in overview.html suggest that I can just use it in any package.

As for porting this code to the 3x branch I see that this branch doesn't have 
modules. Does it mean that it will be a Lucene contrib?

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-12 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032350#comment-13032350
 ] 

Martijn van Groningen commented on LUCENE-1421:
---

bq. But what if docid=2 had age=17 instead? How would we determine what value 
the group (for hgid=1) should have for the age field?
That would depend on the group sort, right? If your group sort is age asc the 
lowest document would be chosen.

bq. Or... would the group count +1 to age=10 and +1 to age=17 in that case?
You mean like a sum of all ages per group? That is interesting, but sounds more 
like a function to me. This can be computed with a separated group collector. 
Wouldn't make sense to me, to have this with a regular field facet. 

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032380#comment-13032380
 ] 

Michael McCandless commented on LUCENE-1421:


{quote}
bq. But what if docid=2 had age=17 instead? How would we determine what value 
the group (for hgid=1) should have for the age field?

That would depend on the group sort, right? If your group sort is age asc the 
lowest document would be chosen.
{quote}

OK, I see.  So the group is represented by the doc within it that
sorts highest according to the group sort, and any faceting on the
groups means faceting on that top doc's values, per group.  Neat.

{quote}
bq. Or... would the group count +1 to age=10 and +1 to age=17 in that case?

You mean like a sum of all ages per group? That is interesting, but sounds more 
like a function to me. This can be computed with a separated group collector. 
Wouldn't make sense to me, to have this with a regular field facet.
{quote}

Well, not sum, but multi-valued?  (Ie, as if this group were
represented by a doc that takes the union of all age values of docs
within it).

This way, if the user then does a drill-down by a specific age, the
number of groups then returned would match the facet count of that age
in the first query.

I agree we need to hash out these semantics :)

Bill could you open a separate Lucene issue, to work out the semantics
 impl for post-grouping-faceting?  Unfortunately, it's blocked by
factoring out the facet module (LUCENE-3079).


 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-11 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032082#comment-13032082
 ] 

Martijn van Groningen commented on LUCENE-1421:
---

Nice work Michael! I also think that the two pass mechanism is definitely the 
preferred way to go. 

I think we also need a strategy mechanism (or at least an GroupCollector class 
hierarchy) inside this module. The mechanism should select the right group 
collector(s) for a certain request. Some users maybe only care about the top 
group document, so I second pass won't be necessary. Another example with 
faceting in mind. When group based faceting is necessary. The top N groups 
don't suffice. You'll need all group docs (I currently don't see a other way). 
These groups docs are then used to create a grouped Solr DocSet. But this 
should be a completely different implementation. 

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032145#comment-13032145
 ] 

Michael McCandless commented on LUCENE-1421:


{quote}
I think we also need a strategy mechanism (or at least an GroupCollector class 
hierarchy) inside this module. The mechanism should select the right group 
collector(s) for a certain request. Some users maybe only care about the top 
group document, so I second pass won't be necessary. Another example with 
faceting in mind. When group based faceting is necessary. The top N groups 
don't suffice. You'll need all group docs (I currently don't see a other way). 
These groups docs are then used to create a grouped Solr DocSet. But this 
should be a completely different implementation.
{quote}

I agree, there's much more we could do here!  Specialized collection for the 
maxDocsPerGroup=1 case, and for the I want all groups case, would be nice.  
For the not many unique values in the group field case we could do a 
single-pass collector, I think.

Grouping by a multi-valued field should be possible (we now have DocTermOrds in 
Lucene, but it doesn't load the term byte[] data), as well as support for 
sharding, ie, by merging top groups and docs w/in each group (but I think we 
need an addition to FieldComparator API for this).

I think we should commit this starting point, today, and then iterate from 
there...

Martijn, thank you for persisting for so long on SOLR-236!  We are
finally getting grouping functionality accessible from Lucene and
Solr...


 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-11 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032238#comment-13032238
 ] 

Bill Bell commented on LUCENE-1421:
---

Say we have 4 documents:

docid=1
hgid=1
age=10

docid=2
hgid=1
age=10

docid=3
hgid=2
age=12

docid=4
hgid=4
age=11

If we group by hgid, we would get:

hgid=1
  docid=1
   hgid=1
   age=10
  docid=2
   hgid=1
   age=10

hgid=3
   docid=3
hgid=2
age=12

hgid=4
docid=4
 hgid=4
 age=11

If I set Facet Counts = POST

age: 10 (1 document)
age: 11 (1 document)
age: 12 (1 document)

If I set Facet Counts = PRE

age: 10 (2 document)
age: 11 (1 document)
age: 12 (1 document)

The only way grouping works in Solr now is Facet Counts = PRE.

Thanks.

 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, LUCENE-1421.patch, 
 lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031266#comment-13031266
 ] 

Michael McCandless commented on LUCENE-1421:


bq. I think that grouping code should be part of Lucene instead of Solr.

+1

This is a very popular issue (currently tied for 2nd place in votes).

Unfortunately, I think the single-pass collector attached here doesn't
scale very well to large maxDoc and/or large number of unique groups.
Also, it pulls a DocTermsIndex on the top-level reader (costly in an
NRT/reopen setting since it's not per-segment).

So I decided to factor out parts of Solr's current two-pass approach
into a shared grouping module.

The downside of the two-pass approach is you run the query twice,
automatically halving your QPS.  (It's even worse because the grouping
itself is somewhat computing intensive too).  To try to help mitigate
this, I also added a new CachingCollector, which just holds hits
(docID and optionally score) up to a max allowed RAM consumption, and
can then replay them for the 2nd pass.  In includes a max RAM
setting so that if too many hits are found, it stops caching (and you
must then re-execute the query).

But one nice side effect of the two-phased approach is that sharding
is in theory straightforward (I think?).  Ie, all shards would do the
first phase, concurrently, to get the top N groups.  Then you
merge-sort the resulting top groups, then run second phase (finding
docs w/in the top groups) on all shards, then merge results from the
same group across all shards.


 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Priority: Minor
 Attachments: lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1421) Ability to group search results by field

2011-05-10 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031537#comment-13031537
 ] 

Bill Bell commented on LUCENE-1421:
---

The issue I have with the group=true feature in Solr is that the facets are not 
calculated post grouping.
So I cannot show the (count) in the facet list for a field.

If we can get the facets to return counts POST grouping that would be ideal.

Bill


 Ability to group search results by field
 

 Key: LUCENE-1421
 URL: https://issues.apache.org/jira/browse/LUCENE-1421
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Reporter: Artyom Sokolov
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-1421.patch, lucene-grouping.patch


 It would be awesome to group search results by specified field. Some 
 functionality was provided for Apache Solr but I think it should be done in 
 Core Lucene. There could be some useful information like total hits about 
 collapsed data like total count and so on.
 Thanks,
 Artyom

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org