Pivot performance

2014-11-13 Thread Neil Ireson
Hi all,

I was running an experiment which involved counting terms by day, so I was 
using pivot facets to get the counts. However as the number of time and term 
values increased the performance got very rubbish. So I knocked up a quick 
test, using a collection of 1 million documents with a different number of 
random values, to compare different ways of getting the counts.

1) Combined = combining the time and term in a single field.
2) Facet = for each term set the query to the term and then get the time facet 
3) Pivot = get the pivot facet.

The results show that, as the number of values (i.e. number of terms * number 
of times) increases, everything is fine until around 100,000 values and then it 
goes pair-shaped for pivots, taking nearly 4 minutes for 1 million values, the 
facet based approach produces much more robust performance.

  |  Processing time in ms |
Values|  Combined| Facet| Pivot|
9 |   144|   391|62|
100   |   170|   501|52|
961   |   789|  1014|   153|
1 |   907|  1966|   940|
99856 |  1647|  3832|  1960|
499849|  5445|  7573|136423|
999867|  9926|  8690|233725|


In the end I used the facet rather than pivot approach but I’d like to know why 
pivots have such a catastrophic performance crash? Is this an expected 
behaviour of pivots or am I doing something wrong?

N



Re: Pivot performance

2014-11-13 Thread Neil Ireson
I found a post 
(http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html
 
http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html)
 commenting that the pivot performance issue happened after version 4.0.0. So I 
ran my test on version 4.0.0 and found that the pivoting did not suffer the 
performance crash, and generally produced much better results.

Values|  Combined| Facet| Pivot|
9 |   180|   300|34|
100   |   163|   521|30|
961   |   729|   666|72|
1 |   709|  1006|   659|
99856 |  1896|  2214|   719|
499849|  2989|  4863|  1719|
999872|  5552|  8113|  3856|

Therefore I think something has definitely go awry.

N


 On 13 Nov 2014, at 13:49, Neil Ireson n.ire...@sheffield.ac.uk wrote:
 
 Hi all,
 
 I was running an experiment which involved counting terms by day, so I was 
 using pivot facets to get the counts. However as the number of time and term 
 values increased the performance got very rubbish. So I knocked up a quick 
 test, using a collection of 1 million documents with a different number of 
 random values, to compare different ways of getting the counts.
 
 1) Combined = combining the time and term in a single field.
 2) Facet = for each term set the query to the term and then get the time 
 facet 
 3) Pivot = get the pivot facet.
 
 The results show that, as the number of values (i.e. number of terms * number 
 of times) increases, everything is fine until around 100,000 values and then 
 it goes pair-shaped for pivots, taking nearly 4 minutes for 1 million values, 
 the facet based approach produces much more robust performance.
 
   |  Processing time in ms |
 Values|  Combined| Facet| Pivot|
 9 |   144|   391|62|
 100   |   170|   501|52|
 961   |   789|  1014|   153|
 1 |   907|  1966|   940|
 99856 |  1647|  3832|  1960|
 499849|  5445|  7573|136423|
 999867|  9926|  8690|233725|
 
 
 In the end I used the facet rather than pivot approach but I’d like to know 
 why pivots have such a catastrophic performance crash? Is this an expected 
 behaviour of pivots or am I doing something wrong?
 
 N
 



Re: Pivot performance

2014-11-13 Thread Neil Ireson
I thought for completeness I’d try and find which version change caused the 
issue and in fact the performance was fine up to and including 4.9.0 and so the 
problem seems to have appeared only since the latest version.

N


 On 13 Nov 2014, at 14:46, Neil Ireson n.ire...@sheffield.ac.uk wrote:
 
 I found a post 
 (http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html
  
 http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-td4074617.html)
  commenting that the pivot performance issue happened after version 4.0.0. So 
 I ran my test on version 4.0.0 and found that the pivoting did not suffer the 
 performance crash, and generally produced much better results.
 
 Values|  Combined| Facet| Pivot|
 9 |   180|   300|34|
 100   |   163|   521|30|
 961   |   729|   666|72|
 1 |   709|  1006|   659|
 99856 |  1896|  2214|   719|
 499849|  2989|  4863|  1719|
 999872|  5552|  8113|  3856|
 
 Therefore I think something has definitely go awry.
 
 N
 
 
 On 13 Nov 2014, at 13:49, Neil Ireson n.ire...@sheffield.ac.uk 
 mailto:n.ire...@sheffield.ac.uk wrote:
 
 Hi all,
 
 I was running an experiment which involved counting terms by day, so I was 
 using pivot facets to get the counts. However as the number of time and term 
 values increased the performance got very rubbish. So I knocked up a quick 
 test, using a collection of 1 million documents with a different number of 
 random values, to compare different ways of getting the counts.
 
 1) Combined = combining the time and term in a single field.
 2) Facet = for each term set the query to the term and then get the time 
 facet 
 3) Pivot = get the pivot facet.
 
 The results show that, as the number of values (i.e. number of terms * 
 number of times) increases, everything is fine until around 100,000 values 
 and then it goes pair-shaped for pivots, taking nearly 4 minutes for 1 
 million values, the facet based approach produces much more robust 
 performance.
 
   |  Processing time in ms |
 Values|  Combined| Facet| Pivot|
 9 |   144|   391|62|
 100   |   170|   501|52|
 961   |   789|  1014|   153|
 1 |   907|  1966|   940|
 99856 |  1647|  3832|  1960|
 499849|  5445|  7573|136423|
 999867|  9926|  8690|233725|
 
 
 In the end I used the facet rather than pivot approach but I’d like to know 
 why pivots have such a catastrophic performance crash? Is this an expected 
 behaviour of pivots or am I doing something wrong?
 
 N
 
 



Re: NPE when using pivots

2014-10-22 Thread Neil Ireson
Firstly apologies as I originally sent this to the dev list by mistake...


Bit of a weird one and not sure if this counts as a bug.

I erroneously created a filter which correctly filters a term into the index 
(e.g. clouds - cloud), however if that term is filtered again it returns no 
value (e.g. cloud - null). This was an invisible issue until I used pivoting. 
If I used the caches then I got:

SEVERE: java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
at 
org.apache.solr.util.ConcurrentLRUCache.get(ConcurrentLRUCache.java:89)
at org.apache.solr.search.FastLRUCache.get(FastLRUCache.java:130)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:1232)
at 
org.apache.solr.handler.component.PivotFacetProcessor.getSubset(PivotFacetProcessor.java:244)

However if I turned the caches off I got a more descriptive error:

SEVERE: java.lang.IllegalArgumentException: Query and filter cannot be null.
at org.apache.lucene.search.FilteredQuery.init(FilteredQuery.java:68)
at org.apache.lucene.search.FilteredQuery.init(FilteredQuery.java:54)
at 
org.apache.lucene.search.IndexSearcher.wrapFilter(IndexSearcher.java:228)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1197)
at 
org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:1241)
at 
org.apache.solr.handler.component.PivotFacetProcessor.getSubset(PivotFacetProcessor.java:244)

The issue seems to be in the PivotFacetProcessor.getSubset method in the 
following two lines

 Query query = ft.getFieldQuery(null, field, pivotValue);
 return searcher.getDocSet(query, base);

In my case ft.getFieldQuery returns a null, which causes the Exception.

As I said my filter was incorrect, however there may be some weird, but valid, 
case where the filter does change the term if it’s passed through multiple 
times (something like a decrement number or date filter). In which case the 
PivotFacetProcessor would not cause a NPE but would produce incorrect values.

So...

1) It might be nice to throw a similar Exception if the caches are used
2) I’m not sure why the pivotValue, which is taken from the index, is re-parsed 
through the filters. Surely if the value is taken from the index it would be 
more efficient (and correct) to just create a TermQuery.

N

Possible parent/child query bug

2013-11-22 Thread Neil Ireson

Note sure if this is a bug but, for me, it was unexpected behaviour.

http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* 

returns all the child docs, as expected, however

http://localhost:8090/solr/select?q={!child+of=doc_type:parent}

returns all the parent docs. 

This seems wrong to me, especially as the following query also returns all the 
parent docs, which would make the two query equivalent:

http://localhost:8090/solr/select?q={!parent+which=doc_type:parent}





Re: Possible parent/child query bug

2013-11-22 Thread Neil Ireson
Some further odd behaviour. For my index

http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* 

Returns a numFound=“22984”, when there are only 2910 documents in the index 
(748 parents, 2162 children).




On 22 Nov 2013, at 12:28, Neil Ireson n.ire...@sheffield.ac.uk wrote:

 
 Note sure if this is a bug but, for me, it was unexpected behaviour.
 
 http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:* 
 
 returns all the child docs, as expected, however
 
 http://localhost:8090/solr/select?q={!child+of=doc_type:parent}
 
 returns all the parent docs. 
 
 This seems wrong to me, especially as the following query also returns all 
 the parent docs, which would make the two query equivalent:
 
 http://localhost:8090/solr/select?q={!parent+which=doc_type:parent}
 
 
 



Re: Possible parent/child query bug

2013-11-22 Thread Neil Ireson
Hi Mikhail,

You are right. 

If the child of” query matches both parent and child docs it returns the child 
documents but a spurious numFound.

For the  “parent which” query if it matches both parent and child docs it 
returns a handy error message “child query must only match non-parent docs...




On 22 Nov 2013, at 14:03, Mikhail Khludnev mkhlud...@griddynamics.com wrote:

 Neil,
 quick hint. Can't you run Solr (jetty) with -ea ? my feeling is that nested
 query (which you put
 *:*http://localhost:8090/solr/select?q=%7B%21child+of=doc_type:parent%7D*:*)
 should be orthogonal to children, that's confirmed by assert. That's true
 for {!parent} at least.
 
 
 On Fri, Nov 22, 2013 at 5:40 PM, Neil Ireson n.ire...@sheffield.ac.ukwrote:
 
 Some further odd behaviour. For my index
 
 http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:*
 
 Returns a numFound=“22984”, when there are only 2910 documents in the
 index (748 parents, 2162 children).
 
 
 
 
 On 22 Nov 2013, at 12:28, Neil Ireson n.ire...@sheffield.ac.uk wrote:
 
 
 Note sure if this is a bug but, for me, it was unexpected behaviour.
 
 http://localhost:8090/solr/select?q={!child+of=doc_type:parent}*:*
 
 returns all the child docs, as expected, however
 
 http://localhost:8090/solr/select?q={!child+of=doc_type:parent}
 
 returns all the parent docs.
 
 This seems wrong to me, especially as the following query also returns
 all the parent docs, which would make the two query equivalent:
 
 http://localhost:8090/solr/select?q={!parent+which=doc_type:parent}
 
 
 
 
 
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com



error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar

2012-12-07 Thread Neil Ireson
In case it is of use, I have just uploaded an updated and mavenised  
version of the Luke code to the Luke discussion list, see https://groups.google.com/d/topic/luke-discuss/MNT_teDxVno/discussion 
.


It seems to work with the latest (4.0.0  4.1-SNAPSHOT) versions of  
Lucene.


N


Feature Request: Return count for documents which are possible to select

2008-12-15 Thread Neil Ireson

Hi all,

Whilst Solr is a great resource (a big thank you to the developers) it 
presents me with a couple of issues.


The need for hierarchical facets I would say is a fairly crucial missing 
piece but has already been pointed out 
(http://issues.apache.org/jira/browse/SOLR-64).


The other issue relates to providing (count) feedback for disjoint 
selections. When a facet value is selected this constrains the documents 
and solr returns the counts for all the other facet values. Thus the 
user can see all the possible valid selections (i.e. having a count 0) 
and the number of documents which will be returned if that value is 
selected. However one of the valid selections is to select another value 
in the facet, creating a disjoint selection and increasingly the number 
of returned documents. However there is currently no way for the user to 
know which values are valid to select as the count only relates to 
currently selected documents and not documents which are also still 
possible to select.


I hope this is clear, it's not the easiest issue to explain (or perhaps 
I just do it badly). Anyway other Faceted Browsers, such as the Simile 
Project's Exhibit, do return counts showing the effect of disjoint 
selections which is more useful for the user.



N



PS I'm unsure whether this should be posted to the developer's list so I 
posted here first.