Pivot performance
Hi all, I was running an experiment which involved counting terms by day, so I was using pivot facets to get the counts. However as the number of time and term values increased the performance got very rubbish. So I knocked up a quick test, using a collection of 1 million documents with a different number of random values, to compare different ways of getting the counts. 1) Combined = combining the time and term in a single field. 2) Facet = for each term set the query to the term and then get the time facet 3) Pivot = get the pivot facet. The results show that, as the number of values (i.e. number of terms * number of times) increases, everything is fine until around 100,000 values and then it goes pair-shaped for pivots, taking nearly 4 minutes for 1 million values, the facet based approach produces much more robust performance. | Processing time in ms | Values| Combined| Facet| Pivot| 9 | 144| 391|62| 100 | 170| 501|52| 961 | 789| 1014| 153| 1 | 907| 1966| 940| 99856 | 1647| 3832| 1960| 499849| 5445| 7573|136423| 999867| 9926| 8690|233725| In the end I used the facet rather than pivot approach but I’d like to know why pivots have such a catastrophic performance crash? Is this an expected behaviour of pivots or am I doing something wrong? N
Re: Pivot performance
I found a post (http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html) commenting that the pivot performance issue happened after version 4.0.0. So I ran my test on version 4.0.0 and found that the pivoting did not suffer the performance crash, and generally produced much better results. Values| Combined| Facet| Pivot| 9 | 180| 300|34| 100 | 163| 521|30| 961 | 729| 666|72| 1 | 709| 1006| 659| 99856 | 1896| 2214| 719| 499849| 2989| 4863| 1719| 999872| 5552| 8113| 3856| Therefore I think something has definitely go awry. N On 13 Nov 2014, at 13:49, Neil Ireson n.ire...@sheffield.ac.uk wrote: Hi all, I was running an experiment which involved counting terms by day, so I was using pivot facets to get the counts. However as the number of time and term values increased the performance got very rubbish. So I knocked up a quick test, using a collection of 1 million documents with a different number of random values, to compare different ways of getting the counts. 1) Combined = combining the time and term in a single field. 2) Facet = for each term set the query to the term and then get the time facet 3) Pivot = get the pivot facet. The results show that, as the number of values (i.e. number of terms * number of times) increases, everything is fine until around 100,000 values and then it goes pair-shaped for pivots, taking nearly 4 minutes for 1 million values, the facet based approach produces much more robust performance. | Processing time in ms | Values| Combined| Facet| Pivot| 9 | 144| 391|62| 100 | 170| 501|52| 961 | 789| 1014| 153| 1 | 907| 1966| 940| 99856 | 1647| 3832| 1960| 499849| 5445| 7573|136423| 999867| 9926| 8690|233725| In the end I used the facet rather than pivot approach but I’d like to know why pivots have such a catastrophic performance crash? Is this an expected behaviour of pivots or am I doing something wrong? N
Re: Pivot performance
I thought for completeness I’d try and find which version change caused the issue and in fact the performance was fine up to and including 4.9.0 and so the problem seems to have appeared only since the latest version. N On 13 Nov 2014, at 14:46, Neil Ireson n.ire...@sheffield.ac.uk wrote: I found a post (http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html) commenting that the pivot performance issue happened after version 4.0.0. So I ran my test on version 4.0.0 and found that the pivoting did not suffer the performance crash, and generally produced much better results. Values| Combined| Facet| Pivot| 9 | 180| 300|34| 100 | 163| 521|30| 961 | 729| 666|72| 1 | 709| 1006| 659| 99856 | 1896| 2214| 719| 499849| 2989| 4863| 1719| 999872| 5552| 8113| 3856| Therefore I think something has definitely go awry. N On 13 Nov 2014, at 13:49, Neil Ireson n.ire...@sheffield.ac.uk mailto:n.ire...@sheffield.ac.uk wrote: Hi all, I was running an experiment which involved counting terms by day, so I was using pivot facets to get the counts. However as the number of time and term values increased the performance got very rubbish. So I knocked up a quick test, using a collection of 1 million documents with a different number of random values, to compare different ways of getting the counts. 1) Combined = combining the time and term in a single field. 2) Facet = for each term set the query to the term and then get the time facet 3) Pivot = get the pivot facet. The results show that, as the number of values (i.e. number of terms * number of times) increases, everything is fine until around 100,000 values and then it goes pair-shaped for pivots, taking nearly 4 minutes for 1 million values, the facet based approach produces much more robust performance. | Processing time in ms | Values| Combined| Facet| Pivot| 9 | 144| 391|62| 100 | 170| 501|52| 961 | 789| 1014| 153| 1 | 907| 1966| 940| 99856 | 1647| 3832| 1960| 499849| 5445| 7573|136423| 999867| 9926| 8690|233725| In the end I used the facet rather than pivot approach but I’d like to know why pivots have such a catastrophic performance crash? Is this an expected behaviour of pivots or am I doing something wrong? N
Re: NPE when using pivots
Firstly apologies as I originally sent this to the dev list by mistake... Bit of a weird one and not sure if this counts as a bug. I erroneously created a filter which correctly filters a term into the index (e.g. clouds - cloud), however if that term is filtered again it returns no value (e.g. cloud - null). This was an invisible issue until I used pivoting. If I used the caches then I got: SEVERE: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) at org.apache.solr.util.ConcurrentLRUCache.get(ConcurrentLRUCache.java:89) at org.apache.solr.search.FastLRUCache.get(FastLRUCache.java:130) at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:1232) at org.apache.solr.handler.component.PivotFacetProcessor.getSubset(PivotFacetProcessor.java:244) However if I turned the caches off I got a more descriptive error: SEVERE: java.lang.IllegalArgumentException: Query and filter cannot be null. at org.apache.lucene.search.FilteredQuery.init(FilteredQuery.java:68) at org.apache.lucene.search.FilteredQuery.init(FilteredQuery.java:54) at org.apache.lucene.search.IndexSearcher.wrapFilter(IndexSearcher.java:228) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1197) at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:1241) at org.apache.solr.handler.component.PivotFacetProcessor.getSubset(PivotFacetProcessor.java:244) The issue seems to be in the PivotFacetProcessor.getSubset method in the following two lines Query query = ft.getFieldQuery(null, field, pivotValue); return searcher.getDocSet(query, base); In my case ft.getFieldQuery returns a null, which causes the Exception. As I said my filter was incorrect, however there may be some weird, but valid, case where the filter does change the term if it’s passed through multiple times (something like a decrement number or date filter). In which case the PivotFacetProcessor would not cause a NPE but would produce incorrect values. So... 1) It might be nice to throw a similar Exception if the caches are used 2) I’m not sure why the pivotValue, which is taken from the index, is re-parsed through the filters. Surely if the value is taken from the index it would be more efficient (and correct) to just create a TermQuery. N
Possible parent/child query bug
Note sure if this is a bug but, for me, it was unexpected behaviour. http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* returns all the child docs, as expected, however http://localhost:8090/solr/select?q={!child+of=doc_type:parent} returns all the parent docs. This seems wrong to me, especially as the following query also returns all the parent docs, which would make the two query equivalent: http://localhost:8090/solr/select?q={!parent+which=doc_type:parent}
Re: Possible parent/child query bug
Some further odd behaviour. For my index http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* Returns a numFound=“22984”, when there are only 2910 documents in the index (748 parents, 2162 children). On 22 Nov 2013, at 12:28, Neil Ireson n.ire...@sheffield.ac.uk wrote: Note sure if this is a bug but, for me, it was unexpected behaviour. http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* returns all the child docs, as expected, however http://localhost:8090/solr/select?q={!child+of=doc_type:parent} returns all the parent docs. This seems wrong to me, especially as the following query also returns all the parent docs, which would make the two query equivalent: http://localhost:8090/solr/select?q={!parent+which=doc_type:parent}
Re: Possible parent/child query bug
Hi Mikhail, You are right. If the child of” query matches both parent and child docs it returns the child documents but a spurious numFound. For the “parent which” query if it matches both parent and child docs it returns a handy error message “child query must only match non-parent docs... On 22 Nov 2013, at 14:03, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Neil, quick hint. Can't you run Solr (jetty) with -ea ? my feeling is that nested query (which you put *:*http://localhost:8090/solr/select?q=%7B%21child+of=doc_type:parent%7D*:*) should be orthogonal to children, that's confirmed by assert. That's true for {!parent} at least. On Fri, Nov 22, 2013 at 5:40 PM, Neil Ireson n.ire...@sheffield.ac.ukwrote: Some further odd behaviour. For my index http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* Returns a numFound=“22984”, when there are only 2910 documents in the index (748 parents, 2162 children). On 22 Nov 2013, at 12:28, Neil Ireson n.ire...@sheffield.ac.uk wrote: Note sure if this is a bug but, for me, it was unexpected behaviour. http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* returns all the child docs, as expected, however http://localhost:8090/solr/select?q={!child+of=doc_type:parent} returns all the parent docs. This seems wrong to me, especially as the following query also returns all the parent docs, which would make the two query equivalent: http://localhost:8090/solr/select?q={!parent+which=doc_type:parent} -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar
In case it is of use, I have just uploaded an updated and mavenised version of the Luke code to the Luke discussion list, see https://groups.google.com/d/topic/luke-discuss/MNT_teDxVno/discussion . It seems to work with the latest (4.0.0 4.1-SNAPSHOT) versions of Lucene. N
Feature Request: Return count for documents which are possible to select
Hi all, Whilst Solr is a great resource (a big thank you to the developers) it presents me with a couple of issues. The need for hierarchical facets I would say is a fairly crucial missing piece but has already been pointed out (http://issues.apache.org/jira/browse/SOLR-64). The other issue relates to providing (count) feedback for disjoint selections. When a facet value is selected this constrains the documents and solr returns the counts for all the other facet values. Thus the user can see all the possible valid selections (i.e. having a count 0) and the number of documents which will be returned if that value is selected. However one of the valid selections is to select another value in the facet, creating a disjoint selection and increasingly the number of returned documents. However there is currently no way for the user to know which values are valid to select as the count only relates to currently selected documents and not documents which are also still possible to select. I hope this is clear, it's not the easiest issue to explain (or perhaps I just do it badly). Anyway other Faceted Browsers, such as the Simile Project's Exhibit, do return counts showing the effect of disjoint selections which is more useful for the user. N PS I'm unsure whether this should be posted to the developer's list so I posted here first.