[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

2013-10-11 Thread Lakshmi Venkataswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792742#comment-13792742
 ] 

Lakshmi Venkataswamy commented on SOLR-4824:


I have tested 4.5.0 version and the same behavior has been observed.  So we are 
staying with 3.6 in production for now.

> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> 
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> 
> 02349 name="facet">ondate
> cc:worde~1 name="facet.field">date numFound="362803" start="0">
>  name="facet_fields">
> 2866
> 11372
> 11514
> 12015
> 11746
> 10853
> 11053
> 11815
> 11427
> 11475
> 11461
> 12058
> 11335
> 12039
> 12064
> 12234
> 12545
> 11766
> 12197
> 11414
> 11633
> 12863
> 12378
> 11947
> 11822
> 11882
> 10474
> 11051
> 11776
> 11957
> 11260
> 8511
>  name="facet_ranges"/>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> 
> 02 name="facet">ondate name="q">cc:worde~1date
> 
>  name="facet_fields">
> 0
> 41
> 21
> 24
> 19
> 9
> 11
> 17
> 14
> 24
> 43
> 14
> 52
> 57
> 25
> 17
> 34
> 11
> 16
> 121
> 33
> 26
> 59
> 27
> 10
> 9
> 6
> 16
> 11
> 15
> 21
> 109
> 11
> 7
> 10
> 8
> 13
> 75
> 77
> 31
> 35
> 22
> 18
> 11
> 68
> 40
>  name="facet_ranges"/>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

2013-05-23 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665103#comment-13665103
 ] 

Jack Krupansky commented on SOLR-4824:
--

Any committer want to clarify whether this is truly a "Bug" as opposed to an 
"Improvement". I think it's the latter. Now, the question is how to resolve it.

Although it might be nice to optionally specify a maxExpansions on every query 
term, maybe it would be sufficient to specify it as a query request parameter, 
maxExpansions=n or maxFuzzyExpansions=n. Maybe it would also be nice to specify 
a solrconfig setting, n. Or, maybe just 
set the default higher in Solr vs. Lucene, say to 100 or 250 or 500 or even 
1000, plus the query request parameter.


> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> 
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> 
> 02349 name="facet">ondate
> cc:worde~1 name="facet.field">date numFound="362803" start="0">
>  name="facet_fields">
> 2866
> 11372
> 11514
> 12015
> 11746
> 10853
> 11053
> 11815
> 11427
> 11475
> 11461
> 12058
> 11335
> 12039
> 12064
> 12234
> 12545
> 11766
> 12197
> 11414
> 11633
> 12863
> 12378
> 11947
> 11822
> 11882
> 10474
> 11051
> 11776
> 11957
> 11260
> 8511
>  name="facet_ranges"/>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> 
> 02 name="facet">ondate name="q">cc:worde~1date
> 
>  name="facet_fields">
> 0
> 41
> 21
> 24
> 19
> 9
> 11
> 17
> 14
> 24
> 43
> 14
> 52
> 57
> 25
> 17
> 34
> 11
> 16
> 121
> 33
> 26
> 59
> 27
> 10
> 9
> 6
> 16
> 11
> 15
> 21
> 109
> 11
> 7
> 10
> 8
> 13
> 75
> 77
> 31
> 35
> 22
> 18
> 11
> 68
> 40
>  name="facet_ranges"/>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

2013-05-22 Thread Lakshmi Venkataswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664179#comment-13664179
 ] 

Lakshmi Venkataswamy commented on SOLR-4824:


That makes sense.  So as a test I tried to restrict the search query to only 30 
days of data after I had ingested the additional 11 days.  This should have 
returned the same number 362,803 as before but it did not.  I got 1263 results. 

I also noticed something else.  We have a production system that is using Solr 
3.5.  I also have test systems on Solr 3.6.2 and Solr 4.3 using a smaller 
subset of production data.  The physical size of the index is very different in 
4.3 for the same data , number of fields, configuration etc.

Solr 3.5 Averages 150 Kb / document 
Solr 3.6.2   Averages 148 KB / document
Solr 4.3 Averages 75 KB / document


> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> 
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> 
> 02349 name="facet">ondate
> cc:worde~1 name="facet.field">date numFound="362803" start="0">
>  name="facet_fields">
> 2866
> 11372
> 11514
> 12015
> 11746
> 10853
> 11053
> 11815
> 11427
> 11475
> 11461
> 12058
> 11335
> 12039
> 12064
> 12234
> 12545
> 11766
> 12197
> 11414
> 11633
> 12863
> 12378
> 11947
> 11822
> 11882
> 10474
> 11051
> 11776
> 11957
> 11260
> 8511
>  name="facet_ranges"/>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> 
> 02 name="facet">ondate name="q">cc:worde~1date
> 
>  name="facet_fields">
> 0
> 41
> 21
> 24
> 19
> 9
> 11
> 17
> 14
> 24
> 43
> 14
> 52
> 57
> 25
> 17
> 34
> 11
> 16
> 121
> 33
> 26
> 59
> 27
> 10
> 9
> 6
> 16
> 11
> 15
> 21
> 109
> 11
> 7
> 10
> 8
> 13
> 75
> 77
> 31
> 35
> 22
> 18
> 11
> 68
> 40
>  name="facet_ranges"/>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

2013-05-21 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663483#comment-13663483
 ] 

Hoss Man commented on SOLR-4824:


I'm not very familiar with the FuzzyQuery code in question, but i believe what 
Jack is referring to is a limit in the number of _terms_ that fuzzy query will 
consider when it scans the indexed terms (via an automata i think?) looking for 
terms within a given edit distance of the input.

so it's not a matter of increasing documents that can cause the results to 
change, it's a matter of increasing the number of terms that are "close" to the 
term used in the fuzzy query.

{panel:title=Hoss'ss Uninformed example/speculation}
Assume for a moment, that you have a small index, where there are less then 50 
terms in the "text" field, and you ask for a fuzzy query matching "abcdefg"  
the list of "close" terms might be...

* abcdeff
* abcdegg
* abcdegf
* zbcdefg

...and there may be a total of 100 documents matching those 4 terms -- 1/2 of 
those matches may be because of the last term ("zbcdefg")

If you index a handful of additional documents, but those documents contain 
1000+ new terms in the "text" field which are very "close" to the input term, 
then the next time you do the same fuzzy quey, the expanded query might 
become...

* abcdeff
* abcdegg
* ...48 more terms that start with "abcd..."

And "zbcdefg" will be excluded from the expanded query, because the expansion 
code will stop looking for additional terms as soon as it finds 50 that are 
"close".

So now you will get results based on this new expansion, which may be less 
documents then were previously found.
{panel}




> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> 
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> 
> 02349 name="facet">ondate
> cc:worde~1 name="facet.field">date numFound="362803" start="0">
>  name="facet_fields">
> 2866
> 11372
> 11514
> 12015
> 11746
> 10853
> 11053
> 11815
> 11427
> 11475
> 11461
> 12058
> 11335
> 12039
> 12064
> 12234
> 12545
> 11766
> 12197
> 11414
> 11633
> 12863
> 12378
> 11947
> 11822
> 11882
> 10474
> 11051
> 11776
> 11957
> 11260
> 8511
>  name="facet_ranges"/>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> 
> 02 name="facet">ondate name="q">cc:worde~1date
> 
>  name="facet_fields">
> 0
> 41
> 21
> 24
> 19
> 9
> 11
> 17
> 14
> 24
> 43
> 14
> 52
> 57
> 25
> 17
> 34
> 11
> 16
> 121
> 33
> 26
> 59
> 27
> 10
> 9
> 6
> 16
> 11
> 15
> 21
> 109
> 11
> 7
> 10
> 8
> 13
> 75
> 77
> 31
> 35
> 22
> 18
> 11
> 68
> 40
>  name="facet_ranges"/>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

2013-05-15 Thread Lakshmi Venkataswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658732#comment-13658732
 ] 

Lakshmi Venkataswamy commented on SOLR-4824:


Not sure I understand.  When I have 30 days of data I get 362,803 results.  
When I add another 11 days worth of data the same search returns 1,338 results. 
 Even if there is a maximum limit would I not see a capping of the results as 
opposed to a drastic drop ?  

> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> 
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> 
> 02349 name="facet">ondate
> cc:worde~1 name="facet.field">date numFound="362803" start="0">
>  name="facet_fields">
> 2866
> 11372
> 11514
> 12015
> 11746
> 10853
> 11053
> 11815
> 11427
> 11475
> 11461
> 12058
> 11335
> 12039
> 12064
> 12234
> 12545
> 11766
> 12197
> 11414
> 11633
> 12863
> 12378
> 11947
> 11822
> 11882
> 10474
> 11051
> 11776
> 11957
> 11260
> 8511
>  name="facet_ranges"/>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> 
> 02 name="facet">ondate name="q">cc:worde~1date
> 
>  name="facet_fields">
> 0
> 41
> 21
> 24
> 19
> 9
> 11
> 17
> 14
> 24
> 43
> 14
> 52
> 57
> 25
> 17
> 34
> 11
> 16
> 121
> 33
> 26
> 59
> 27
> 10
> 9
> 6
> 16
> 11
> 15
> 21
> 109
> 11
> 7
> 10
> 8
> 13
> 75
> 77
> 31
> 35
> 22
> 18
> 11
> 68
> 40
>  name="facet_ranges"/>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number

2013-05-15 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658718#comment-13658718
 ] 

Jack Krupansky commented on SOLR-4824:
--

Lucene FuzzyQuery has a parameter named "maxExpansions", which defaults to 50, 
which I believe is the largest number of candidate terms the fuzzy query will 
"rewite", so that once you have that many matches, I don't think any more will 
be found. Robert or one of the other Lucene experts can confirm.

At the Lucene level this can be changed, with the FuzzyQuery(Term term, int 
maxEdits, int prefixLength, int maxExpansions, boolean transpositions) 
constructor, but the Solr query parser uses the FuzzyQuery(Term term, int 
maxEdits, int prefixLength) constructor, so there is no provision for 
overriding that limit of 50.
Also note that even in Lucene maxExpansions is limited to maxBooleanQueries, 
which would be 1024 unless you override that in solrconfig. Not that that would 
do you any good unless you had a query parser that let you set maxExpansions.

Still, that is a reasonable enhancement request.


> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> 
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> 
> 02349 name="facet">ondate
> cc:worde~1 name="facet.field">date numFound="362803" start="0">
>  name="facet_fields">
> 2866
> 11372
> 11514
> 12015
> 11746
> 10853
> 11053
> 11815
> 11427
> 11475
> 11461
> 12058
> 11335
> 12039
> 12064
> 12234
> 12545
> 11766
> 12197
> 11414
> 11633
> 12863
> 12378
> 11947
> 11822
> 11882
> 10474
> 11051
> 11776
> 11957
> 11260
> 8511
>  name="facet_ranges"/>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> 
> 02 name="facet">ondate name="q">cc:worde~1date
> 
>  name="facet_fields">
> 0
> 41
> 21
> 24
> 19
> 9
> 11
> 17
> 14
> 24
> 43
> 14
> 52
> 57
> 25
> 17
> 34
> 11
> 16
> 121
> 33
> 26
> 59
> 27
> 10
> 9
> 6
> 16
> 11
> 15
> 21
> 109
> 11
> 7
> 10
> 8
> 13
> 75
> 77
> 31
> 35
> 22
> 18
> 11
> 68
> 40
>  name="facet_ranges"/>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org