[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number
[ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792742#comment-13792742 ] Lakshmi Venkataswamy commented on SOLR-4824: I have tested 4.5.0 version and the same behavior has been observed. So we are staying with 3.6 in production for now. > Fuzzy / Faceting results are changed after ingestion of documents past a > certain number > > > Key: SOLR-4824 > URL: https://issues.apache.org/jira/browse/SOLR-4824 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2, 4.3 > Environment: Ubuntu 12.04 LTS 12.04.2 > jre1.7.0_17 > jboss-as-7.1.1.Final >Reporter: Lakshmi Venkataswamy > > In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, > I found that after a certain number of documents were ingested the fuzzy > query had drastically lower number of results. We have approximately 18,000 > documents per day and after ingesting approximately 40 days of documents, the > next incremental day of documents results in a lower number of results of a > fuzzy search. > The query : > http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort > produces the following result before the threshold is crossed > > 02349 name="facet">ondate > cc:worde~1 name="facet.field">date numFound="362803" start="0"> > name="facet_fields"> > 2866 > 11372 > 11514 > 12015 > 11746 > 10853 > 11053 > 11815 > 11427 > 11475 > 11461 > 12058 > 11335 > 12039 > 12064 > 12234 > 12545 > 11766 > 12197 > 11414 > 11633 > 12863 > 12378 > 11947 > 11822 > 11882 > 10474 > 11051 > 11776 > 11957 > 11260 > 8511 > name="facet_ranges"/> > Once the 40 days of documents ingested threshold is crossed the results drop > as show below for the same query > > 02 name="facet">ondate name="q">cc:worde~1date > > name="facet_fields"> > 0 > 41 > 21 > 24 > 19 > 9 > 11 > 17 > 14 > 24 > 43 > 14 > 52 > 57 > 25 > 17 > 34 > 11 > 16 > 121 > 33 > 26 > 59 > 27 > 10 > 9 > 6 > 16 > 11 > 15 > 21 > 109 > 11 > 7 > 10 > 8 > 13 > 75 > 77 > 31 > 35 > 22 > 18 > 11 > 68 > 40 > name="facet_ranges"/> > I have also tested this with different months of data and have seen the same > issue around the number of documents. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number
[ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665103#comment-13665103 ] Jack Krupansky commented on SOLR-4824: -- Any committer want to clarify whether this is truly a "Bug" as opposed to an "Improvement". I think it's the latter. Now, the question is how to resolve it. Although it might be nice to optionally specify a maxExpansions on every query term, maybe it would be sufficient to specify it as a query request parameter, maxExpansions=n or maxFuzzyExpansions=n. Maybe it would also be nice to specify a solrconfig setting, n. Or, maybe just set the default higher in Solr vs. Lucene, say to 100 or 250 or 500 or even 1000, plus the query request parameter. > Fuzzy / Faceting results are changed after ingestion of documents past a > certain number > > > Key: SOLR-4824 > URL: https://issues.apache.org/jira/browse/SOLR-4824 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2, 4.3 > Environment: Ubuntu 12.04 LTS 12.04.2 > jre1.7.0_17 > jboss-as-7.1.1.Final >Reporter: Lakshmi Venkataswamy > > In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, > I found that after a certain number of documents were ingested the fuzzy > query had drastically lower number of results. We have approximately 18,000 > documents per day and after ingesting approximately 40 days of documents, the > next incremental day of documents results in a lower number of results of a > fuzzy search. > The query : > http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort > produces the following result before the threshold is crossed > > 02349 name="facet">ondate > cc:worde~1 name="facet.field">date numFound="362803" start="0"> > name="facet_fields"> > 2866 > 11372 > 11514 > 12015 > 11746 > 10853 > 11053 > 11815 > 11427 > 11475 > 11461 > 12058 > 11335 > 12039 > 12064 > 12234 > 12545 > 11766 > 12197 > 11414 > 11633 > 12863 > 12378 > 11947 > 11822 > 11882 > 10474 > 11051 > 11776 > 11957 > 11260 > 8511 > name="facet_ranges"/> > Once the 40 days of documents ingested threshold is crossed the results drop > as show below for the same query > > 02 name="facet">ondate name="q">cc:worde~1date > > name="facet_fields"> > 0 > 41 > 21 > 24 > 19 > 9 > 11 > 17 > 14 > 24 > 43 > 14 > 52 > 57 > 25 > 17 > 34 > 11 > 16 > 121 > 33 > 26 > 59 > 27 > 10 > 9 > 6 > 16 > 11 > 15 > 21 > 109 > 11 > 7 > 10 > 8 > 13 > 75 > 77 > 31 > 35 > 22 > 18 > 11 > 68 > 40 > name="facet_ranges"/> > I have also tested this with different months of data and have seen the same > issue around the number of documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number
[ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664179#comment-13664179 ] Lakshmi Venkataswamy commented on SOLR-4824: That makes sense. So as a test I tried to restrict the search query to only 30 days of data after I had ingested the additional 11 days. This should have returned the same number 362,803 as before but it did not. I got 1263 results. I also noticed something else. We have a production system that is using Solr 3.5. I also have test systems on Solr 3.6.2 and Solr 4.3 using a smaller subset of production data. The physical size of the index is very different in 4.3 for the same data , number of fields, configuration etc. Solr 3.5 Averages 150 Kb / document Solr 3.6.2 Averages 148 KB / document Solr 4.3 Averages 75 KB / document > Fuzzy / Faceting results are changed after ingestion of documents past a > certain number > > > Key: SOLR-4824 > URL: https://issues.apache.org/jira/browse/SOLR-4824 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2, 4.3 > Environment: Ubuntu 12.04 LTS 12.04.2 > jre1.7.0_17 > jboss-as-7.1.1.Final >Reporter: Lakshmi Venkataswamy > > In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, > I found that after a certain number of documents were ingested the fuzzy > query had drastically lower number of results. We have approximately 18,000 > documents per day and after ingesting approximately 40 days of documents, the > next incremental day of documents results in a lower number of results of a > fuzzy search. > The query : > http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort > produces the following result before the threshold is crossed > > 02349 name="facet">ondate > cc:worde~1 name="facet.field">date numFound="362803" start="0"> > name="facet_fields"> > 2866 > 11372 > 11514 > 12015 > 11746 > 10853 > 11053 > 11815 > 11427 > 11475 > 11461 > 12058 > 11335 > 12039 > 12064 > 12234 > 12545 > 11766 > 12197 > 11414 > 11633 > 12863 > 12378 > 11947 > 11822 > 11882 > 10474 > 11051 > 11776 > 11957 > 11260 > 8511 > name="facet_ranges"/> > Once the 40 days of documents ingested threshold is crossed the results drop > as show below for the same query > > 02 name="facet">ondate name="q">cc:worde~1date > > name="facet_fields"> > 0 > 41 > 21 > 24 > 19 > 9 > 11 > 17 > 14 > 24 > 43 > 14 > 52 > 57 > 25 > 17 > 34 > 11 > 16 > 121 > 33 > 26 > 59 > 27 > 10 > 9 > 6 > 16 > 11 > 15 > 21 > 109 > 11 > 7 > 10 > 8 > 13 > 75 > 77 > 31 > 35 > 22 > 18 > 11 > 68 > 40 > name="facet_ranges"/> > I have also tested this with different months of data and have seen the same > issue around the number of documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number
[ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663483#comment-13663483 ] Hoss Man commented on SOLR-4824: I'm not very familiar with the FuzzyQuery code in question, but i believe what Jack is referring to is a limit in the number of _terms_ that fuzzy query will consider when it scans the indexed terms (via an automata i think?) looking for terms within a given edit distance of the input. so it's not a matter of increasing documents that can cause the results to change, it's a matter of increasing the number of terms that are "close" to the term used in the fuzzy query. {panel:title=Hoss'ss Uninformed example/speculation} Assume for a moment, that you have a small index, where there are less then 50 terms in the "text" field, and you ask for a fuzzy query matching "abcdefg" the list of "close" terms might be... * abcdeff * abcdegg * abcdegf * zbcdefg ...and there may be a total of 100 documents matching those 4 terms -- 1/2 of those matches may be because of the last term ("zbcdefg") If you index a handful of additional documents, but those documents contain 1000+ new terms in the "text" field which are very "close" to the input term, then the next time you do the same fuzzy quey, the expanded query might become... * abcdeff * abcdegg * ...48 more terms that start with "abcd..." And "zbcdefg" will be excluded from the expanded query, because the expansion code will stop looking for additional terms as soon as it finds 50 that are "close". So now you will get results based on this new expansion, which may be less documents then were previously found. {panel} > Fuzzy / Faceting results are changed after ingestion of documents past a > certain number > > > Key: SOLR-4824 > URL: https://issues.apache.org/jira/browse/SOLR-4824 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2, 4.3 > Environment: Ubuntu 12.04 LTS 12.04.2 > jre1.7.0_17 > jboss-as-7.1.1.Final >Reporter: Lakshmi Venkataswamy > > In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, > I found that after a certain number of documents were ingested the fuzzy > query had drastically lower number of results. We have approximately 18,000 > documents per day and after ingesting approximately 40 days of documents, the > next incremental day of documents results in a lower number of results of a > fuzzy search. > The query : > http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort > produces the following result before the threshold is crossed > > 02349 name="facet">ondate > cc:worde~1 name="facet.field">date numFound="362803" start="0"> > name="facet_fields"> > 2866 > 11372 > 11514 > 12015 > 11746 > 10853 > 11053 > 11815 > 11427 > 11475 > 11461 > 12058 > 11335 > 12039 > 12064 > 12234 > 12545 > 11766 > 12197 > 11414 > 11633 > 12863 > 12378 > 11947 > 11822 > 11882 > 10474 > 11051 > 11776 > 11957 > 11260 > 8511 > name="facet_ranges"/> > Once the 40 days of documents ingested threshold is crossed the results drop > as show below for the same query > > 02 name="facet">ondate name="q">cc:worde~1date > > name="facet_fields"> > 0 > 41 > 21 > 24 > 19 > 9 > 11 > 17 > 14 > 24 > 43 > 14 > 52 > 57 > 25 > 17 > 34 > 11 > 16 > 121 > 33 > 26 > 59 > 27 > 10 > 9 > 6 > 16 > 11 > 15 > 21 > 109 > 11 > 7 > 10 > 8 > 13 > 75 > 77 > 31 > 35 > 22 > 18 > 11 > 68 > 40 > name="facet_ranges"/> > I have also tested this with different months of data and have seen the same > issue around the number of documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number
[ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658732#comment-13658732 ] Lakshmi Venkataswamy commented on SOLR-4824: Not sure I understand. When I have 30 days of data I get 362,803 results. When I add another 11 days worth of data the same search returns 1,338 results. Even if there is a maximum limit would I not see a capping of the results as opposed to a drastic drop ? > Fuzzy / Faceting results are changed after ingestion of documents past a > certain number > > > Key: SOLR-4824 > URL: https://issues.apache.org/jira/browse/SOLR-4824 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2, 4.3 > Environment: Ubuntu 12.04 LTS 12.04.2 > jre1.7.0_17 > jboss-as-7.1.1.Final >Reporter: Lakshmi Venkataswamy > > In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, > I found that after a certain number of documents were ingested the fuzzy > query had drastically lower number of results. We have approximately 18,000 > documents per day and after ingesting approximately 40 days of documents, the > next incremental day of documents results in a lower number of results of a > fuzzy search. > The query : > http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort > produces the following result before the threshold is crossed > > 02349 name="facet">ondate > cc:worde~1 name="facet.field">date numFound="362803" start="0"> > name="facet_fields"> > 2866 > 11372 > 11514 > 12015 > 11746 > 10853 > 11053 > 11815 > 11427 > 11475 > 11461 > 12058 > 11335 > 12039 > 12064 > 12234 > 12545 > 11766 > 12197 > 11414 > 11633 > 12863 > 12378 > 11947 > 11822 > 11882 > 10474 > 11051 > 11776 > 11957 > 11260 > 8511 > name="facet_ranges"/> > Once the 40 days of documents ingested threshold is crossed the results drop > as show below for the same query > > 02 name="facet">ondate name="q">cc:worde~1date > > name="facet_fields"> > 0 > 41 > 21 > 24 > 19 > 9 > 11 > 17 > 14 > 24 > 43 > 14 > 52 > 57 > 25 > 17 > 34 > 11 > 16 > 121 > 33 > 26 > 59 > 27 > 10 > 9 > 6 > 16 > 11 > 15 > 21 > 109 > 11 > 7 > 10 > 8 > 13 > 75 > 77 > 31 > 35 > 22 > 18 > 11 > 68 > 40 > name="facet_ranges"/> > I have also tested this with different months of data and have seen the same > issue around the number of documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4824) Fuzzy / Faceting results are changed after ingestion of documents past a certain number
[ https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658718#comment-13658718 ] Jack Krupansky commented on SOLR-4824: -- Lucene FuzzyQuery has a parameter named "maxExpansions", which defaults to 50, which I believe is the largest number of candidate terms the fuzzy query will "rewite", so that once you have that many matches, I don't think any more will be found. Robert or one of the other Lucene experts can confirm. At the Lucene level this can be changed, with the FuzzyQuery(Term term, int maxEdits, int prefixLength, int maxExpansions, boolean transpositions) constructor, but the Solr query parser uses the FuzzyQuery(Term term, int maxEdits, int prefixLength) constructor, so there is no provision for overriding that limit of 50. Also note that even in Lucene maxExpansions is limited to maxBooleanQueries, which would be 1024 unless you override that in solrconfig. Not that that would do you any good unless you had a query parser that let you set maxExpansions. Still, that is a reasonable enhancement request. > Fuzzy / Faceting results are changed after ingestion of documents past a > certain number > > > Key: SOLR-4824 > URL: https://issues.apache.org/jira/browse/SOLR-4824 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2, 4.3 > Environment: Ubuntu 12.04 LTS 12.04.2 > jre1.7.0_17 > jboss-as-7.1.1.Final >Reporter: Lakshmi Venkataswamy > > In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, > I found that after a certain number of documents were ingested the fuzzy > query had drastically lower number of results. We have approximately 18,000 > documents per day and after ingesting approximately 40 days of documents, the > next incremental day of documents results in a lower number of results of a > fuzzy search. > The query : > http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort > produces the following result before the threshold is crossed > > 02349 name="facet">ondate > cc:worde~1 name="facet.field">date numFound="362803" start="0"> > name="facet_fields"> > 2866 > 11372 > 11514 > 12015 > 11746 > 10853 > 11053 > 11815 > 11427 > 11475 > 11461 > 12058 > 11335 > 12039 > 12064 > 12234 > 12545 > 11766 > 12197 > 11414 > 11633 > 12863 > 12378 > 11947 > 11822 > 11882 > 10474 > 11051 > 11776 > 11957 > 11260 > 8511 > name="facet_ranges"/> > Once the 40 days of documents ingested threshold is crossed the results drop > as show below for the same query > > 02 name="facet">ondate name="q">cc:worde~1date > > name="facet_fields"> > 0 > 41 > 21 > 24 > 19 > 9 > 11 > 17 > 14 > 24 > 43 > 14 > 52 > 57 > 25 > 17 > 34 > 11 > 16 > 121 > 33 > 26 > 59 > 27 > 10 > 9 > 6 > 16 > 11 > 15 > 21 > 109 > 11 > 7 > 10 > 8 > 13 > 75 > 77 > 31 > 35 > 22 > 18 > 11 > 68 > 40 > name="facet_ranges"/> > I have also tested this with different months of data and have seen the same > issue around the number of documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org