[jira] [Commented] (SOLR-8096) Major faceting performance regressions

Uwe Schindler (JIRA) Fri, 25 Sep 2015 02:50:14 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907870#comment-14907870
 ]


Uwe Schindler commented on SOLR-8096:
-------------------------------------

bq. Use of the highly optimized faceting that Solr had for multi-valued fields 
over relatively static indexes was secretly removed as part of LUCENE-5666, 
causing severe performance regressions.

Hi, the removal was not "secret". Removal of FieldCache from Lucene (and 
replacement by UninvertingReader) was discussed on the Issue tracker, although 
interest by Solr people was small. I think this is the main issue here. 
Sometimes it would be good to have Solr committers taking part of discussions 
on Lucene issues. If you want to make Solr bettre, you should also help in 
making Lucene better!

The old field cache was also put into a separate module (with the new DocValues 
emulating-API), because we (Lucene Committers) knew that Solr still uses it. 
Sure, we could have used UninvertingReader on top of 
SlowCompositeReaderWrapper, but this would bring other slowness! So the 
committers decided to step forward and remove the top-level facetting (which 
was long overdue).

It was announced in several talks about Lucene 5 that FieldCache was removed 
and all facetting in Solr was implicitely changed to only use per segment field 
caches (e.g., see my talk @ focdem 2015, JAX 2015, or berlinbuzzwords - around 
one of the last slides). Maybe there should have been added a changes entry 
also to the Solr CHANGES.txt about this, but 

The CHANGES.txt about this entry was, the first line mentions that facetting in 
Solr is involved. Any Solr committer could have looked into the code and bring 
up complaints about those changes in the issue tracker also after this commit 
has been done:

{quote}
* LUCENE-5666: Change uninverted access (sorting, faceting, grouping, etc)
  to use the DocValues API instead of FieldCache. For FieldCache functionality,
  use UninvertingReader in lucene/misc (or implement your own FilterReader).
  UninvertingReader is more efficient: supports multi-valued numeric fields,
  detects when a multi-valued field is single-valued, reuses caches
  of compatible types (e.g. SORTED also supports BINARY and SORTED_SET access
  without insanity).  "Insanity" is no longer possible unless you explicitly 
want it. 
  Rename FieldCache* and DocTermOrds* classes in the search package to 
DocValues*. 
  Move SortedSetSortField to core and add SortedSetFieldSource to queries/, 
which
  takes the same selectors. Add helper methods to DocValues.java that are 
better 
  suited for search code (never return null, etc).  (Mike McCandless, Robert 
Muir)
{quote}


bq. The people who did this are elasticsearch employees. That is one way to 
deal with Solr's faster faceting!

This is speculation and really a bad behaviour on an Open Source issue tracker. 
We should discuss here about technical stuff, not make any assumptions about 
what people intend to do. This statement was posted by a person 
([~mmurphy3141]) who I never met in person, and who really seldem took place in 
Lucene/Solr discussions at all. So I don't think we should count on that. It is 
also bad behaviour to accuse committers on twitter about sabotage: 
https://twitter.com/mmurphy3141/status/647254551356162048; please don't do 
this. I would ask to remove this tweet, thanks.

I was informed about the changes mentioned here and I strongly agree with the 
committers behind LUCENE-5666. I was always in favour of removing those 
top-level facetting algorithms. So they still have my strong +1. On my Solr 
customers I have seen nobody who complained about slow top-level facetting 
(because I told them long time ago to no longer use those outdated top-level 
algorithms if they have dynamic indexes).

The right thing to do for Solr people would be to remove those top-level stuff 
completely. This is no longer fitting the new reader structure (composite and 
atomic/leaf readers) of Lucene 3 (with API cleanups to better reflect the new 
structure in Lucene 4). Lucene 3 is now several years retired already! So there 
was long time to fix Solr's facetting to go away from top-level. People with 
static indexes can still force merge their index and will have the same 
performance with the new algorithms.

Please keep in mind that it took about half a year until the first one 
recognized a problem like this, which makes me think that only few people are 
using those mostly-static indexes. 

*We should work on this issue to fix the issue, not accuse people, thanks!*

> Major faceting performance regressions
> --------------------------------------
>
>                 Key: SOLR-8096
>                 URL: https://issues.apache.org/jira/browse/SOLR-8096
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.0, 5.1, 5.2, 5.3, Trunk
>            Reporter: Yonik Seeley
>            Priority: Critical
>
> Use of the highly optimized faceting that Solr had for multi-valued fields 
> over relatively static indexes was *secretly removed* as part of LUCENE-5666, 
> causing severe performance regressions.
> Here are some quick benchmarks to gauge the damage, on a 5M document index, 
> with each field having between 0 and 5 values per document.  *Higher numbers 
> represent worse 5x performance*.
> Solr 5.4_dev faceting time as a percent of Solr 4.10.3 faceting time          
> ||...................................|| Percent of index being faceted
> ||num_unique_values|| 10%     || 50% || 90% ||
> |10           | 351.17%       | 1587.08%      | 3057.28% |
> |100          | 158.10%       | 203.61%       | 1421.93% |
> |1000 | 143.78%       | 168.01%       | 1325.87% |
> |10000        | 137.98%       | 175.31%       | 1233.97% |
> |100000       | 142.98%       | 159.42%       | 1252.45% |
> |1000000      | 255.15%       | 165.17%       | 1236.75% |
> For example, a field with 1000 unique values in the whole index, faceting 
> with 5x took 143% of the 4x time, when ~10% of the docs in the index were 
> faceted.
> One user who brought the performance problem to our attention: 
> http://markmail.org/message/ekmqh4ocbkwxv3we
> "faceting is unusable slow since upgrade to 5.3.0" (from 4.10.3)
> The disabling of the UnInvertedField algorithm was previously discovered in 
> SOLR-7190, but we didn't know just how bad the problem was at that time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8096) Major faceting performance regressions

Reply via email to