Re: strange behavior of solr query parser
Hi Phil.Staley Thanks for your reply. but I'm afraid that's a different problem. Our problem can be confirmed since at least SOLR 7.3.0. (the oldest version we have) And we guess it might already exists since SOLR-9786. https://github.com/apache/lucene-solr/commit/bf9db95f218f49bac8e7971eb953a9fd9d13a2f0#diff-269ae02e56283ced3ce781cce21b3147R563 sincerely hongtai 送信元: "Staley, Phil R - DCF" Reply-To: "d...@lucene.apache.org" 日付: 2020年3月2日 月曜日 22:38 宛先: solr_user lucene_apache , "d...@lucene.apache.org" 件名: Re: strange behavior of solr query parser I believe we are experiencing the same thing. We recently upgraded to our Drupal 8 sites to SOLR 8.3.1. We are now getting reports of certain patterns of search terms resulting in an error that reads, “The website encountered an unexpected error. Please try again later.” Below is a list of example terms that always result in this error and a similar list that works fine. The problem pattern seems to be a search term that contains 2 or 3 characters followed by a space, followed by additional text. To confirm that the problem is version 8 of SOLR, I have updated our local and UAT sites with the latest Drupal updates that did include an update to the Search API Solr module and tested the terms below under SOLR 7.7.2, 8.3.1, and 8.4.1. Under version 7.7.2 everything works fine. Under either of the version 8, the problem returns. Thoughts? Search terms that result in error • w-2 agency directory • agency w-2 directory • w-2 agency • w-2 directory • w2 agency directory • w2 agency • w2 directory Search terms that do not result in error • w-22 agency directory • agency directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 • agency directory • agency • directory • -2 agency directory • 2 agency directory • w-2agency directory • w2agency directory From: Hongtai Xue Sent: Monday, March 2, 2020 3:45 AM To: solr_user lucene_apache Cc: d...@lucene.apache.org Subject: strange behavior of solr query parser Hi, Our team found a strange behavior of solr query parser. In some specific cases, some conditional clauses on unindexed field will be ignored. for query like, q=A:1 OR B:1 OR A:2 OR B:2 if field B is not indexed(but docValues="true"), "B:1" will be lost. but if you write query like, q=A:1 OR A:2 OR B:1 OR B:2, it will work perfect. the only difference of two queries is that they are wrote in different orders. one is ABAB, another is AABB, ■reproduce steps and example explanation you can easily reproduce this problem on a solr collection with _default configset and exampledocs/books.csv data. 1. create a _default collection bin/solr create -c books -s 2 -rf 2 2. post books.csv. bin/post -c books example/exampledocs/books.csv 3. run following query. http://localhost:8983/solr/books/select?q=%2B%28name_str%3AFoundation+OR+cat%3Abook+OR+name_str%3AJhereg+OR+cat%3Acd%29=query I printed query parsing debug information. you can tell "name_str:Foundation" is lost. query: "name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd" (please note "Jhereg" is "4a 68 65 72 65 67" and "Foundation" is "46 6f 75 6e 64 61 74 69 6f 6e") "debug":{ "rawquerystring":"+(name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd)", "querystring":"+(name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd)", "parsedquery":"+(cat:book cat:cd (name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]]))", "parsedquery_toString":"+(cat:book cat:cd name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]])", "QParser":"LuceneQParser"}} but for query: "name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd", everything is OK. "name_str:Foundation" is not lost. "debug":{ "rawquerystring":"+(name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd)", "querystring":"+(name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd)", "parsedquery":"+(cat:book cat:cd ((name_str:[[46 6f 75 6e 64 61 74 69 6f 6e] TO [46 6f 75 6e 64 61 74 69 6f 6e]]) (name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]])))", "parsedquery_toString":"+(cat:book cat:cd (name_str:[[46 6f 75 6e 64 61 74 69 6f 6e] TO [46 6f 75 6e 64 61 74 69 6f 6e]] name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]]))", "QParser":"LuceneQParser"}} http://localhost:8983/solr/books/select?q=%2B%28name_str%3AFoundation+OR+name_str%3AJhereg+OR+cat%3Abook+OR+cat%3Acd%29=query we did a little bit research, and we wander if it is a bug of SolrQueryParser. more specifically, we think if statement here might be wrong. https://github.com/apache/lucene-solr/blob/branch_8_4/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L711 Could you please tell us if it is a bug, or it's just a wrong query statement. Thanks, Hongtai Xue
Re: strange behavior of solr query parser
I believe we are experiencing the same thing. We recently upgraded to our Drupal 8 sites to SOLR 8.3.1. We are now getting reports of certain patterns of search terms resulting in an error that reads, “The website encountered an unexpected error. Please try again later.” Below is a list of example terms that always result in this error and a similar list that works fine. The problem pattern seems to be a search term that contains 2 or 3 characters followed by a space, followed by additional text. To confirm that the problem is version 8 of SOLR, I have updated our local and UAT sites with the latest Drupal updates that did include an update to the Search API Solr module and tested the terms below under SOLR 7.7.2, 8.3.1, and 8.4.1. Under version 7.7.2 everything works fine. Under either of the version 8, the problem returns. Thoughts? Search terms that result in error * w-2 agency directory * agency w-2 directory * w-2 agency * w-2 directory * w2 agency directory * w2 agency * w2 directory Search terms that do not result in error * w-22 agency directory * agency directory w-2 * agency w-2directory * agencyw-2 directory * w-2 * w2 * agency directory * agency * directory * -2 agency directory * 2 agency directory * w-2agency directory * w2agency directory From: Hongtai Xue Sent: Monday, March 2, 2020 3:45 AM To: solr_user lucene_apache Cc: d...@lucene.apache.org Subject: strange behavior of solr query parser Hi, Our team found a strange behavior of solr query parser. In some specific cases, some conditional clauses on unindexed field will be ignored. for query like, q=A:1 OR B:1 OR A:2 OR B:2 if field B is not indexed(but docValues="true"), "B:1" will be lost. but if you write query like, q=A:1 OR A:2 OR B:1 OR B:2, it will work perfect. the only difference of two queries is that they are wrote in different orders. one is ABAB, another is AABB, ■reproduce steps and example explanation you can easily reproduce this problem on a solr collection with _default configset and exampledocs/books.csv data. 1. create a _default collection bin/solr create -c books -s 2 -rf 2 2. post books.csv. bin/post -c books example/exampledocs/books.csv 3. run following query. http://localhost:8983/solr/books/select?q=%2B%28name_str%3AFoundation+OR+cat%3Abook+OR+name_str%3AJhereg+OR+cat%3Acd%29=query I printed query parsing debug information. you can tell "name_str:Foundation" is lost. query: "name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd" (please note "Jhereg" is "4a 68 65 72 65 67" and "Foundation" is "46 6f 75 6e 64 61 74 69 6f 6e") "debug":{ "rawquerystring":"+(name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd)", "querystring":"+(name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd)", "parsedquery":"+(cat:book cat:cd (name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]]))", "parsedquery_toString":"+(cat:book cat:cd name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]])", "QParser":"LuceneQParser"}} but for query: "name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd", everything is OK. "name_str:Foundation" is not lost. "debug":{ "rawquerystring":"+(name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd)", "querystring":"+(name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd)", "parsedquery":"+(cat:book cat:cd ((name_str:[[46 6f 75 6e 64 61 74 69 6f 6e] TO [46 6f 75 6e 64 61 74 69 6f 6e]]) (name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]])))", "parsedquery_toString":"+(cat:book cat:cd (name_str:[[46 6f 75 6e 64 61 74 69 6f 6e] TO [46 6f 75 6e 64 61 74 69 6f 6e]] name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]]))", "QParser":"LuceneQParser"}} http://localhost:8983/solr/books/select?q=%2B%28name_str%3AFoundation+OR+name_str%3AJhereg+OR+cat%3Abook+OR+cat%3Acd%29=query we did a little bit research, and we wander if it is a bug of SolrQueryParser. more specifically, we think if statement here might be wrong. https://github.com/apache/lucene-solr/blob/branch_8_4/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L711 Could you please tell us if it is a bug, or it's just a wrong query statement. Thanks, Hongtai Xue
strange behavior of solr query parser
Hi, Our team found a strange behavior of solr query parser. In some specific cases, some conditional clauses on unindexed field will be ignored. for query like, q=A:1 OR B:1 OR A:2 OR B:2 if field B is not indexed(but docValues="true"), "B:1" will be lost. but if you write query like, q=A:1 OR A:2 OR B:1 OR B:2, it will work perfect. the only difference of two queries is that they are wrote in different orders. one is ABAB, another is AABB, ■reproduce steps and example explanation you can easily reproduce this problem on a solr collection with _default configset and exampledocs/books.csv data. 1. create a _default collection bin/solr create -c books -s 2 -rf 2 2. post books.csv. bin/post -c books example/exampledocs/books.csv 3. run following query. http://localhost:8983/solr/books/select?q=%2B%28name_str%3AFoundation+OR+cat%3Abook+OR+name_str%3AJhereg+OR+cat%3Acd%29=query I printed query parsing debug information. you can tell "name_str:Foundation" is lost. query: "name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd" (please note "Jhereg" is "4a 68 65 72 65 67" and "Foundation" is "46 6f 75 6e 64 61 74 69 6f 6e") "debug":{ "rawquerystring":"+(name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd)", "querystring":"+(name_str:Foundation OR cat:book OR name_str:Jhereg OR cat:cd)", "parsedquery":"+(cat:book cat:cd (name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]]))", "parsedquery_toString":"+(cat:book cat:cd name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]])", "QParser":"LuceneQParser"}} but for query: "name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd", everything is OK. "name_str:Foundation" is not lost. "debug":{ "rawquerystring":"+(name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd)", "querystring":"+(name_str:Foundation OR name_str:Jhereg OR cat:book OR cat:cd)", "parsedquery":"+(cat:book cat:cd ((name_str:[[46 6f 75 6e 64 61 74 69 6f 6e] TO [46 6f 75 6e 64 61 74 69 6f 6e]]) (name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]])))", "parsedquery_toString":"+(cat:book cat:cd (name_str:[[46 6f 75 6e 64 61 74 69 6f 6e] TO [46 6f 75 6e 64 61 74 69 6f 6e]] name_str:[[4a 68 65 72 65 67] TO [4a 68 65 72 65 67]]))", "QParser":"LuceneQParser"}} http://localhost:8983/solr/books/select?q=%2B%28name_str%3AFoundation+OR+name_str%3AJhereg+OR+cat%3Abook+OR+cat%3Acd%29=query we did a little bit research, and we wander if it is a bug of SolrQueryParser. more specifically, we think if statement here might be wrong. https://github.com/apache/lucene-solr/blob/branch_8_4/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L711 Could you please tell us if it is a bug, or it's just a wrong query statement. Thanks, Hongtai Xue
Re: strange behavior
Hi David, I see. It fixed now by adding the (). Thank you so much! q=audit_author.name:(Burley,%20S.K.)%20AND%20entity.type:polymer -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: strange behavior
Hi Shawn, I see. I added () and it works now. Thank you very much for your help! q=audit_author.name:(Burley,%20S.K.)%20AND%20entity.type:polymer=1 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: strange behavior
On 6/6/2019 12:46 PM, Wendy2 wrote: Why "AND" didn't work anymore? I use Solr 7.3.1 and edismax parser. Could someone explain to me why the following query doesn't work any more? What could be the cause? Thanks! q=audit_author.name:Burley,%20S.K.%20AND%20entity.type:polymer It worked previously but now returned very lower number of documents. I had to use "fq" to make it work correctly: q=audit_author.name:Burley,%20S.K.=entity.type:polymer=1 That should work no problem with edismax. It would not however work properly with dismax, and it would be easy to mix up the two query parsers. The way you have written your query is somewhat ambiguous, because of the space after the comma. That ambiguity exists in both of the queries mentioned, even the one with the fq. Thanks, Shawn
Re: strange behavior
audit_author.name:Burley,%20S.K. translates to audit_author.name:Burley, DEFAULT_OPERATOR DEFAULT_FIELD:S.K. On Thu, Jun 6, 2019 at 2:46 PM Wendy2 wrote: > > Hi, > > Why "AND" didn't work anymore? > > I use Solr 7.3.1 and edismax parser. > Could someone explain to me why the following query doesn't work any > more? > What could be the cause? Thanks! > > q=audit_author.name:Burley,%20S.K.%20AND%20entity.type:polymer > > It worked previously but now returned very lower number of documents. > I had to use "fq" to make it work correctly: > > q=audit_author.name:Burley,%20S.K.=entity.type:polymer=1 > > > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
strange behavior
Hi, Why "AND" didn't work anymore? I use Solr 7.3.1 and edismax parser. Could someone explain to me why the following query doesn't work any more? What could be the cause? Thanks! q=audit_author.name:Burley,%20S.K.%20AND%20entity.type:polymer It worked previously but now returned very lower number of documents. I had to use "fq" to make it work correctly: q=audit_author.name:Burley,%20S.K.=entity.type:polymer=1 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Strange Behavior When Extracting Features
If anyone else is following this thread, I replied on the Jira. On Mon, Oct 16, 2017 at 4:07 AM, alessandro.benedettiwrote: > This is interesting, the EFI parameter resolution should work using the > quotes independently of the query parser. > At that point, the query parsers (both) receive a multi term text. > Both of them should work the same. > At the time I saw the mail I tried to reproduce it through the LTR module > tests and I didn't succeed . > It would be quite useful if you can contribute a test that is failing with > the field query parser. > Have you tried just with the same query, but in a request handler ? > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Strange Behavior When Extracting Features
This is interesting, the EFI parameter resolution should work using the quotes independently of the query parser. At that point, the query parsers (both) receive a multi term text. Both of them should work the same. At the time I saw the mail I tried to reproduce it through the LTR module tests and I didn't succeed . It would be quite useful if you can contribute a test that is failing with the field query parser. Have you tried just with the same query, but in a request handler ? - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Strange Behavior When Extracting Features
I believe I've discovered a workaround. If you use: { "store": "redhat_efi_feature_store", "name": "case_description_issue_tfidf", "class": "org.apache.solr.ltr.feature.SolrFeature", "params": { "q":"{!dismax qf=text_tfidf}${text}" } } instead of: { "store": "redhat_efi_feature_store", "name": "case_description_issue_tfidf", "class": "org.apache.solr.ltr.feature.SolrFeature", "params": { "q": "{!field f=issue_tfidf}${case_description}" } } you can then use single quotes to incorporate multi-term arguments as Alessandro suggested. I've added this information to the Jira. On Fri, Sep 22, 2017 at 8:30 AM, alessandro.benedettiwrote: > I think this has nothing to do with the LTR plugin. > The problem here should be just the way you use the local params, > to properly pass multi term local params in Solr you need to use *'* : > > efi.case_description='added couple of fiber channel' > > This should work. > If not only the first term will be passed as a local param and then passed > in the efi map to LTR. > > I will update the Jira issue as well. > > Cheers > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Strange Behavior When Extracting Features
I think this has nothing to do with the LTR plugin. The problem here should be just the way you use the local params, to properly pass multi term local params in Solr you need to use *'* : efi.case_description='added couple of fiber channel' This should work. If not only the first term will be passed as a local param and then passed in the efi map to LTR. I will update the Jira issue as well. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Strange Behavior When Extracting Features
Hi all, I'm getting some extremely strange behavior when trying to extract features for a learning to rank model. The following query incorrectly says all features have zero values: http://gss-test-fusion.usersys.redhat.com:8983/solr/access/query?q=added couple of fiber channel={!ltr model=redhat_efi_model reRankDocs=1 efi.case_summary=the efi.case_description=added couple of fiber channel efi.case_issue=the efi.case_environment=the}=id,score,[features]=10 But this query, which simply moves the word "added" from the front of the provided text to the back, properly fills in the feature values: http://gss-test-fusion.usersys.redhat.com:8983/solr/access/query?q=couple of fiber channel added={!ltr model=redhat_efi_model reRankDocs=1 efi.case_summary=the efi.case_description=couple of fiber channel added efi.case_issue=the efi.case_environment=the}=id,score,[features]=10 The explain output for the failing query can be found here: https://gist.github.com/manisnesan/18a8f1804f29b1b62ebfae1211f38cc4 and the explain output for the properly functioning query can be found here: https://gist.github.com/manisnesan/47685a561605e2229434b38aed11cc65 Have any of you run into this issue? Seems like it could be a bug. Thanks, Michael A. Alcorn
Re: Strange behavior of solr
Is there any error message in the log when Solr stops indexing the file at line 2046? Regards, Edwin On 2 September 2015 at 17:17, Long Yanwrote: > Hey, > I have created a core with > bin\solr create -c mycore > > I want to index the csv sample files from solr-5.2.1 > > If I index film.csv under solr-5.2.1\example\films\, solr can only index > this file until the line > "2046,Wong Kar-wai,Romance Film|Fantasy|Science > Fiction|Drama,,/en/2046_2004,2004-05-20" > > But if I at first index books.csv under solr-5.2.1\example\exampledocs and > then index film.csv, solr can index all lines in film.csv > > Why? > > Regards > Long Yan > > >
Re: Strange behavior of solr
See example/films/README.txt The “name” field is guessed incorrectly (because the first film has name=“.45”, so indexing errors once it hits a name value that is no longer numeric. The README provides a command to define the name field *before* indexing. If you’ve indexed and had the name field guessed incorrectly and created, you’ll need to delete and recreate the collection, then define the name field, then reindex. We used to have a fake film at the top to allow field guessing to “work”, but I felt that was too fake and that the example should be true to what happens with real world data and the pitfalls of allowing field type guessing to guess incorrectly. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com > On Sep 2, 2015, at 5:17 AM, Long Yanwrote: > > Hey, > I have created a core with > bin\solr create -c mycore > > I want to index the csv sample files from solr-5.2.1 > > If I index film.csv under solr-5.2.1\example\films\, solr can only index this > file until the line > "2046,Wong Kar-wai,Romance Film|Fantasy|Science > Fiction|Drama,,/en/2046_2004,2004-05-20" > > But if I at first index books.csv under solr-5.2.1\example\exampledocs and > then index film.csv, solr can index all lines in film.csv > > Why? > > Regards > Long Yan > >
Strange behavior of solr
Hey, I have created a core with bin\solr create -c mycore I want to index the csv sample files from solr-5.2.1 If I index film.csv under solr-5.2.1\example\films\, solr can only index this file until the line "2046,Wong Kar-wai,Romance Film|Fantasy|Science Fiction|Drama,,/en/2046_2004,2004-05-20" But if I at first index books.csv under solr-5.2.1\example\exampledocs and then index film.csv, solr can index all lines in film.csv Why? Regards Long Yan
Re: Strange Behavior
It sounds as if you are trying to treat hyphen as a digit so that negative numbers are discrete terms. But... that conflicts with the use of hyphen as a word separator. Sorry, but WDF does not support both. Pick one or the other, you can't have both. But first, please explain your intended use case clearly - there may be some better way to try to achieve it. Use the analysis page of the Solr Admin UI to see the detailed query and index analysis of your terms. You'll be surprised. -- Jack Krupansky -Original Message- From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) Sent: Thursday, August 21, 2014 2:31 PM To: solr-user@lucene.apache.org Subject: Strange Behavior Hi , I have a field type text_general where query type for worddelimiter I am using the below type: where wddftype.txt contains - DIGIT When I do a query I am not getting the right results. E.g. Name:Wi-Fi Gets results but Name:Wi-Fi Devices Make not getting any results but if I change it to Name:Wi-Fi Devices Make~3 it works. If someone can explain what is happening with the current situation..? FYI I have the types=wdfftypes.txt in Query Analyzer. My Fieldtype fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=0 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=0 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 types=wdfftypes.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType
Re: Strange Behavior
On 8/23/2014 9:01 AM, Jack Krupansky wrote: It sounds as if you are trying to treat hyphen as a digit so that negative numbers are discrete terms. But... that conflicts with the use of hyphen as a word separator. Sorry, but WDF does not support both. Pick one or the other, you can't have both. But first, please explain your intended use case clearly - there may be some better way to try to achieve it. Use the analysis page of the Solr Admin UI to see the detailed query and index analysis of your terms. You'll be surprised. You can force WDF to treat hyphen as a digit if you want to, but you are right that you cannot have both. To change WDF, create a text file, put the following in it, and reference it with the types parameter on WordDelimiterFilterFactory: - = DIGIT I use this functionality to build a special analysis chain for mimetypes. FOR that fieldType, I treat hyphen and underscore as ALPHANUM. Search for wdfftypes on this page for more info: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Naturally you have to reindex after making this change. For anyone who doesn't know what that entails: http://wiki.apache.org/solr/HowToReindex Thanks, Shawn
Strange Behavior
Hi , I have a field type text_general where query type for worddelimiter I am using the below type: where wddftype.txt contains - DIGIT When I do a query I am not getting the right results. E.g. Name:Wi-Fi Gets results but Name:Wi-Fi Devices Make not getting any results but if I change it to Name:Wi-Fi Devices Make~3 it works. If someone can explain what is happening with the current situation..? FYI I have the types=wdfftypes.txt in Query Analyzer. My Fieldtype fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=0 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.KStemFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=0 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 types=wdfftypes.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType
Re: Strange Behavior with Solr in Tomcat.
Thanks, Meraj, that was exactly the issue , setting useColdSearchertrue/useColdSearcher worked like a charm and the server starts up as usual. Thanks again! On Fri, Jun 6, 2014 at 2:42 PM, Meraj A. Khan mera...@gmail.com wrote: This looks distinctly related to https://issues.apache.org/jira/browse/SOLR-4408 , try coldSearcher = true as being suggested in JIRA and let us know . On Fri, Jun 6, 2014 at 2:39 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I would try a thread dump and check the output to see what`s going on. You could also strace the process if you`re running on Unix or changed the log level in Solr to get more information logged -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: June-06-14 2:33 PM To: solr-user@lucene.apache.org Subject: Re: Strange Behavior with Solr in Tomcat. Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks. - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date: 27/05/2014 La Base de données des virus a expiré.
Re: Strange Behavior with Solr in Tomcat.
Interesting, thanks for reporting back. I've re-opened SOLR-4408. On Sat, Jun 7, 2014 at 10:50 PM, S.L simpleliving...@gmail.com wrote: Thanks, Meraj, that was exactly the issue , setting useColdSearchertrue/useColdSearcher worked like a charm and the server starts up as usual. Thanks again! On Fri, Jun 6, 2014 at 2:42 PM, Meraj A. Khan mera...@gmail.com wrote: This looks distinctly related to https://issues.apache.org/jira/browse/SOLR-4408 , try coldSearcher = true as being suggested in JIRA and let us know . On Fri, Jun 6, 2014 at 2:39 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I would try a thread dump and check the output to see what`s going on. You could also strace the process if you`re running on Unix or changed the log level in Solr to get more information logged -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: June-06-14 2:33 PM To: solr-user@lucene.apache.org Subject: Re: Strange Behavior with Solr in Tomcat. Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks. - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date: 27/05/2014 La Base de données des virus a expiré. -- Regards, Shalin Shekhar Mangar.
Re: Strange Behavior with Solr in Tomcat.
Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
RE: Strange Behavior with Solr in Tomcat.
I would try a thread dump and check the output to see what`s going on. You could also strace the process if you`re running on Unix or changed the log level in Solr to get more information logged -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: June-06-14 2:33 PM To: solr-user@lucene.apache.org Subject: Re: Strange Behavior with Solr in Tomcat. Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks. - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date: 27/05/2014 La Base de données des virus a expiré.
Re: Strange Behavior with Solr in Tomcat.
This looks distinctly related to https://issues.apache.org/jira/browse/SOLR-4408 , try coldSearcher = true as being suggested in JIRA and let us know . On Fri, Jun 6, 2014 at 2:39 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: I would try a thread dump and check the output to see what`s going on. You could also strace the process if you`re running on Unix or changed the log level in Solr to get more information logged -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: June-06-14 2:33 PM To: solr-user@lucene.apache.org Subject: Re: Strange Behavior with Solr in Tomcat. Anyone folks? On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks. - Aucun virus trouvé dans ce message. Analyse effectuée par AVG - www.avg.fr Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date: 27/05/2014 La Base de données des virus a expiré.
Strange Behavior with Solr in Tomcat.
Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
Re: Strange Behavior with Solr in Tomcat.
I guess if you try to copy the index and then kill the process of tomcat then it might help. If still the index need to be delete you would have the back up. Next time always make back up. On Jun 4, 2014 7:55 PM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
Re: Strange Behavior with Solr in Tomcat.
Hi, This is not a case of accidental deletion , the only way I can restart the tomcat is by deleting the data directory for the index that was created earlier, this started happening after I started using spellcheckers in my solrconfig.xml. As long as the Tomcat is running its fine. Any help from anyone who faced a similar issues would be appreciated. Thanks. On Wed, Jun 4, 2014 at 11:08 AM, Aman Tandon antn.s...@gmail.com wrote: I guess if you try to copy the index and then kill the process of tomcat then it might help. If still the index need to be delete you would have the back up. Next time always make back up. On Jun 4, 2014 7:55 PM, S.L simpleliving...@gmail.com wrote: Hi Folks, I recently started using the spellchecker in my solrconfig.xml. I am able to build up an index in Solr. But,if I ever shutdown tomcat I am not able to restart it.The server never spits out the server startup time in seconds in the logs,nor does it print any error messages in the catalina.out file. The only way for me to get around this is by delete the data directory of the index and then start the server,obviously this makes me loose my index. Just wondering if anyone faced a similar issue and if they were able to solve this. Thanks.
Re: Strange behavior of edismax and mm=0 with long queries (bug?)
Actually I found why... I had and as lowercase word in my queries at the checkbox does not seem to work in the admin UI. adding lowercaseOperators=false made the queries work. 2014-04-04 18:10 GMT+02:00 Nils Kaiser m...@nils-kaiser.de: Hey, I am currently using solr to recognize songs and people from a list of user comments. My index stores the titles of the songs. At the moment my application builds word ngrams and fires a search with that query, which works well but is quite inefficient. So my thought was to simply use the collated comments as query. So it is a case where the query is much longer. I need to use mm=0 or mm=1. My plan was to use edismax as the pf2 and pf3 parameters should work well for my usecase. However when using longer queries, I get a strange behavior which can be seen in debugQuery. Here is an example: Collated Comments (used as query) I love Henry so much. It is hard to tear your eyes away from Maria, but watch just his feet. You'll be amazed. sometimes pure skill can will a comp, sometimes pure joy can win... put them both together and there is no competition This video clip makes me smile. Pure joy! so good! Who's the person that gave this a thumbs down?!? This is one of the best routines I've ever seen. Period. And it's a competitionl! How is that possible? They're so good it boggles my mind. It's gorgeous. Flawless victory. Great number! Does anybody know the name of the piece? I believe it's called Sunny side of the street Maria is like, the best 'follow' I've ever seen. She's so amazing. Thanks so much Johnathan! Song name in Index Louis Armstrong - Sunny Side of The Street parsedquery_toString: +(((text:I) (text:love) (text:Henry) (text:so) (text:much.) (text:It) (text:is) (text:hard) (text:to) (text:tear) (text:your) (text:eyes) (text:away) (text:from) (text:Maria,) (text:but) (text:watch) (text:just) (text:his) (text:feet.) (text:You'll) (text:be) (text:amazed.) (text:sometimes) (text:pure) (text:skill) (text:can) (text:will) (text:a) (text:comp,) (text:sometimes) (text:pure) (text:joy) (text:can) (text:win...) (text:put) (text:them) (text:both) +(text:together) +(text:there) (text:is) (text:no) (text:competition) (text:This) (text:video) (text:clip) (text:makes) (text:me) (text:smile.) (text:Pure) (text:joy!) (text:so) (text:good!) (text:Who's) (text:the) (text:person) (text:that) (text:gave) (text:this) (text:a) (text:thumbs) (text:down?!?) (text:This) (text:is) (text:one) (text:of) (text:the) (text:best) (text:routines) (text:I've) (text:ever) (text:seen.) +(text:Period.) +(text:it's) (text:a) (text:competitionl!) (text:How) (text:is) (text:that) (text:possible?) (text:They're) (text:so) (text:good) (text:it) (text:boggles) (text:my) (text:mind.) (text:It's) (text:gorgeous.) (text:Flawless) (text:victory.) (text:Great) (text:number!) (text:Does) (text:anybody) (text:know) (text:the) (text:name) (text:of) (text:the) (text:piece?) (text:I) (text:believe) (text:it's) (text:called) (text:Sunny) (text:side) (text:of) (text:the) (text:street) (text:Maria) (text:is) (text:like,) (text:the) (text:best) (text:'follow') (text:I've) (text:ever) (text:seen.) (text:She's) (text:so) (text:amazing.) (text:Thanks) (text:so) (text:much) (text:Johnathan!))~1)/str This query generates 0 results. The reason is it expects terms together, there, Period., it's to be part of the document (see parsedquery above, all other terms are optional, those terms are must). Is there any reason for this behavior? If I use shorter queries it works flawlessly and returns the document. I've appended the whole query. Best, Nils
Re: Strange behavior of edismax and mm=0 with long queries (bug?)
Set the q.op parameter to OR and set mm=10% or something like that. The idea is to not excessively restrict the documents that will match, but weight the matched results based on how many word pairs and triples do match. In addition, use the pf parameter to provide extra weight when the full query term phrase matches exactly. -- Jack Krupansky From: Nils Kaiser Sent: Friday, April 4, 2014 10:10 AM To: solr-user@lucene.apache.org Subject: Strange behavior of edismax and mm=0 with long queries (bug?) Hey, I am currently using solr to recognize songs and people from a list of user comments. My index stores the titles of the songs. At the moment my application builds word ngrams and fires a search with that query, which works well but is quite inefficient. So my thought was to simply use the collated comments as query. So it is a case where the query is much longer. I need to use mm=0 or mm=1. My plan was to use edismax as the pf2 and pf3 parameters should work well for my usecase. However when using longer queries, I get a strange behavior which can be seen in debugQuery. Here is an example: Collated Comments (used as query) I love Henry so much. It is hard to tear your eyes away from Maria, but watch just his feet. You'll be amazed. sometimes pure skill can will a comp, sometimes pure joy can win... put them both together and there is no competition This video clip makes me smile. Pure joy! so good! Who's the person that gave this a thumbs down?!? This is one of the best routines I've ever seen. Period. And it's a competitionl! How is that possible? They're so good it boggles my mind. It's gorgeous. Flawless victory. Great number! Does anybody know the name of the piece? I believe it's called Sunny side of the street Maria is like, the best 'follow' I've ever seen. She's so amazing. Thanks so much Johnathan! Song name in Index Louis Armstrong - Sunny Side of The Street parsedquery_toString: +(((text:I) (text:love) (text:Henry) (text:so) (text:much.) (text:It) (text:is) (text:hard) (text:to) (text:tear) (text:your) (text:eyes) (text:away) (text:from) (text:Maria,) (text:but) (text:watch) (text:just) (text:his) (text:feet.) (text:You'll) (text:be) (text:amazed.) (text:sometimes) (text:pure) (text:skill) (text:can) (text:will) (text:a) (text:comp,) (text:sometimes) (text:pure) (text:joy) (text:can) (text:win...) (text:put) (text:them) (text:both) +(text:together) +(text:there) (text:is) (text:no) (text:competition) (text:This) (text:video) (text:clip) (text:makes) (text:me) (text:smile.) (text:Pure) (text:joy!) (text:so) (text:good!) (text:Who's) (text:the) (text:person) (text:that) (text:gave) (text:this) (text:a) (text:thumbs) (text:down?!?) (text:This) (text:is) (text:one) (text:of) (text:the) (text:best) (text:routines) (text:I've) (text:ever) (text:seen.) +(text:Period.) +(text:it's) (text:a) (text:competitionl!) (text:How) (text:is) (text:that) (text:possible?) (text:They're) (text:so) (text:good) (text:it) (text:boggles) (text:my) (text:mind.) (text:It's) (text:gorgeous.) (text:Flawless) (text:victory.) (text:Great) (text:number!) (text:Does) (text:anybody) (text:know) (text:the) (text:name) (text:of) (text:the) (text:piece?) (text:I) (text:believe) (text:it's) (text:called) (text:Sunny) (text:side) (text:of) (text:the) (text:street) (text:Maria) (text:is) (text:like,) (text:the) (text:best) (text:'follow') (text:I've) (text:ever) (text:seen.) (text:She's) (text:so) (text:amazing.) (text:Thanks) (text:so) (text:much) (text:Johnathan!))~1)/str This query generates 0 results. The reason is it expects terms together, there, Period., it's to be part of the document (see parsedquery above, all other terms are optional, those terms are must). Is there any reason for this behavior? If I use shorter queries it works flawlessly and returns the document. I've appended the whole query. Best, Nils
Strange behavior of edismax and mm=0 with long queries (bug?)
Hey, I am currently using solr to recognize songs and people from a list of user comments. My index stores the titles of the songs. At the moment my application builds word ngrams and fires a search with that query, which works well but is quite inefficient. So my thought was to simply use the collated comments as query. So it is a case where the query is much longer. I need to use mm=0 or mm=1. My plan was to use edismax as the pf2 and pf3 parameters should work well for my usecase. However when using longer queries, I get a strange behavior which can be seen in debugQuery. Here is an example: Collated Comments (used as query) I love Henry so much. It is hard to tear your eyes away from Maria, but watch just his feet. You'll be amazed. sometimes pure skill can will a comp, sometimes pure joy can win... put them both together and there is no competition This video clip makes me smile. Pure joy! so good! Who's the person that gave this a thumbs down?!? This is one of the best routines I've ever seen. Period. And it's a competitionl! How is that possible? They're so good it boggles my mind. It's gorgeous. Flawless victory. Great number! Does anybody know the name of the piece? I believe it's called Sunny side of the street Maria is like, the best 'follow' I've ever seen. She's so amazing. Thanks so much Johnathan! Song name in Index Louis Armstrong - Sunny Side of The Street parsedquery_toString: +(((text:I) (text:love) (text:Henry) (text:so) (text:much.) (text:It) (text:is) (text:hard) (text:to) (text:tear) (text:your) (text:eyes) (text:away) (text:from) (text:Maria,) (text:but) (text:watch) (text:just) (text:his) (text:feet.) (text:You'll) (text:be) (text:amazed.) (text:sometimes) (text:pure) (text:skill) (text:can) (text:will) (text:a) (text:comp,) (text:sometimes) (text:pure) (text:joy) (text:can) (text:win...) (text:put) (text:them) (text:both) +(text:together) +(text:there) (text:is) (text:no) (text:competition) (text:This) (text:video) (text:clip) (text:makes) (text:me) (text:smile.) (text:Pure) (text:joy!) (text:so) (text:good!) (text:Who's) (text:the) (text:person) (text:that) (text:gave) (text:this) (text:a) (text:thumbs) (text:down?!?) (text:This) (text:is) (text:one) (text:of) (text:the) (text:best) (text:routines) (text:I've) (text:ever) (text:seen.) +(text:Period.) +(text:it's) (text:a) (text:competitionl!) (text:How) (text:is) (text:that) (text:possible?) (text:They're) (text:so) (text:good) (text:it) (text:boggles) (text:my) (text:mind.) (text:It's) (text:gorgeous.) (text:Flawless) (text:victory.) (text:Great) (text:number!) (text:Does) (text:anybody) (text:know) (text:the) (text:name) (text:of) (text:the) (text:piece?) (text:I) (text:believe) (text:it's) (text:called) (text:Sunny) (text:side) (text:of) (text:the) (text:street) (text:Maria) (text:is) (text:like,) (text:the) (text:best) (text:'follow') (text:I've) (text:ever) (text:seen.) (text:She's) (text:so) (text:amazing.) (text:Thanks) (text:so) (text:much) (text:Johnathan!))~1)/str This query generates 0 results. The reason is it expects terms together, there, Period., it's to be part of the document (see parsedquery above, all other terms are optional, those terms are must). Is there any reason for this behavior? If I use shorter queries it works flawlessly and returns the document. I've appended the whole query. Best, Nils ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime11/int /lst result name=response numFound=0 start=0 /result lst name=debug str name=rawquerystringI love Henry so much. It is hard to tear your eyes away from Maria, but watch just his feet. You'll be amazed. sometimes pure skill can will a comp, sometimes pure joy can win... put them both together and there is no competition This video clip makes me smile. Pure joy! so good! Who's the person that gave this a thumbs down?!? This is one of the best routines I've ever seen. Period. And it's a competitionl! How is that possible? They're so good it boggles my mind. It's gorgeous. Flawless victory. Great number! Does anybody know the name of the piece? I believe it's called Sunny side of the street Maria is like, the best 'follow' I've ever seen. She's so amazing. Thanks so much Johnathan! /str str name=querystringI love Henry so much. It is hard to tear your eyes away from Maria, but watch just his feet. You'll be amazed. sometimes pure skill can will a comp, sometimes pure joy can win... put them both together and there is no competition This video clip makes me smile. Pure joy! so good! Who's the person that gave this a thumbs down?!? This is one of the best routines I've ever seen. Period. And it's a competitionl! How is that possible? They're so good it boggles my mind. It's gorgeous. Flawless victory. Great number! Does anybody know the name of the piece? I believe it's called Sunny side of the street Maria is like, the best 'follow' I've ever seen. She's so amazing. Thanks so much Johnathan
Strange behavior while deleting
hi friends, I have observed a strange behavior, I have two indexes of same ids and same number of docs, and i am using a json file to delete records from both the indexes, after deleting the ids, the resulting indexes now show different count of docs, Not sure why I used curl with the same json file to delete from both the indexes. Please advise asap, thanks -- Thanks and kind Regards, Abhishek
Re: Strange behavior while deleting
Do the two cores have identical schema and solrconfig files? Are the delete and merge config settings the sameidentical? Are these two cores running on the same Solr server, or two separate Solr servers? If the latter, are they both running the same release of Solr? How big is the discrepancy - just a few, dozens, 10%, 50%? -- Jack Krupansky -Original Message- From: abhishek jain Sent: Monday, March 31, 2014 3:26 AM To: solr-user@lucene.apache.org Subject: Strange behavior while deleting hi friends, I have observed a strange behavior, I have two indexes of same ids and same number of docs, and i am using a json file to delete records from both the indexes, after deleting the ids, the resulting indexes now show different count of docs, Not sure why I used curl with the same json file to delete from both the indexes. Please advise asap, thanks -- Thanks and kind Regards, Abhishek
Re: Strange behavior while deleting
Hi, These settings are commented in schema. These are two different solr severs and almost identical schema with the exception of one stemmed field. Same solr versions are running. Please help. Thanks Abhishek Original Message From: Jack Krupansky Sent: Monday, 31 March 2014 14:54 To: solr-user@lucene.apache.org Reply To: solr-user@lucene.apache.org Subject: Re: Strange behavior while deleting Do the two cores have identical schema and solrconfig files? Are the delete and merge config settings the sameidentical? Are these two cores running on the same Solr server, or two separate Solr servers? If the latter, are they both running the same release of Solr? How big is the discrepancy - just a few, dozens, 10%, 50%? -- Jack Krupansky -Original Message- From: abhishek jain Sent: Monday, March 31, 2014 3:26 AM To: solr-user@lucene.apache.org Subject: Strange behavior while deleting hi friends, I have observed a strange behavior, I have two indexes of same ids and same number of docs, and i am using a json file to delete records from both the indexes, after deleting the ids, the resulting indexes now show different count of docs, Not sure why I used curl with the same json file to delete from both the indexes. Please advise asap, thanks -- Thanks and kind Regards, Abhishek
Re: Strange behavior while deleting
So, how big is the discrepancy? If you do a *:* query for rows=100, is the 100th result the same for both? Do a bunch of random queries and see if you can find a document key that is missing from one core, but present in the other, and check if it should have been deleted. Are you deleting by id or by query? Do you do an explicit commit on your update request? If not, it could just take a few minutes before the commit actually occurs. Are the two Solr servers on the same machine or different machines? If the latter, is one of the machines significantly faster than the other. -- Jack Krupansky -Original Message- From: abhishek.netj...@gmail.com Sent: Monday, March 31, 2014 5:48 AM To: solr-user@lucene.apache.org ; solr-user@lucene.apache.org Subject: Re: Strange behavior while deleting Hi, These settings are commented in schema. These are two different solr severs and almost identical schema with the exception of one stemmed field. Same solr versions are running. Please help. Thanks Abhishek Original Message From: Jack Krupansky Sent: Monday, 31 March 2014 14:54 To: solr-user@lucene.apache.org Reply To: solr-user@lucene.apache.org Subject: Re: Strange behavior while deleting Do the two cores have identical schema and solrconfig files? Are the delete and merge config settings the sameidentical? Are these two cores running on the same Solr server, or two separate Solr servers? If the latter, are they both running the same release of Solr? How big is the discrepancy - just a few, dozens, 10%, 50%? -- Jack Krupansky -Original Message- From: abhishek jain Sent: Monday, March 31, 2014 3:26 AM To: solr-user@lucene.apache.org Subject: Strange behavior while deleting hi friends, I have observed a strange behavior, I have two indexes of same ids and same number of docs, and i am using a json file to delete records from both the indexes, after deleting the ids, the resulting indexes now show different count of docs, Not sure why I used curl with the same json file to delete from both the indexes. Please advise asap, thanks -- Thanks and kind Regards, Abhishek
Strange behavior of gap fragmenter on highlighting
I'm seeing a rare behavior of the gap fragmenter on solr 3.6. Right now this is my configuration for the gap fragmenter: fragmenter name=gap default=true class=solr.highlight.GapFragmenter lst name=defaults int name=hl.fragsize150/int /lst /fragmenter This is the basic configuration, just tweaked the fragsize parameter to get shorter fragments. The thing is that for 1 particular PDF document in my results I get a really long snippet, way over 150 characters. This get a little more odd, if I change the 150 value for 100 the snippet for the same document it's normal ~ 100 characters. The type of the field being highlighted is this: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.SnowballPorterFilterFactory languange=Spanish/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 types=characters.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Any ideas about what's happening?? Or how could I debug what is really going on?? Greetings! III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu
Re: Strange behavior on text field with number-text content
Hmmm, there are two things you _must_ get familiar with when diagnosing these G.. 1 admin/analysis. That'll show you exactly what the analysis chain does, and it's not always obvious. 2 add debug=query to your input and look at the parsed query results. For instance, this name:4nSolution Inc. parses as name:4nSolution defaultfield:inc. That doesn't explain why name=4nSolutions, except.. your index chain has splitOnCaseChange=1 and your query bit has splitOnCaseChange=0 which doesn't seem right Best Erick On Tue, May 28, 2013 at 10:31 AM, Алексей Цой alexey...@gmail.com wrote: solr-user-unsubscribe solr-user-unsubscr...@lucene.apache.org 2013/5/28 Michał Matulka michal.matu...@gowork.pl Thanks for your responses, I must admit that after hours of trying I made some mistakes. So the most problematic phrase will now be: 4nSolution Inc. which cannot be found using query: name:4nSolution or even name:4nSolution Inc. but can be using following queries: name:nSolution name:4 name:inc Sorry for the mess, it turned out I didn't reindex fields after modyfying schema so I thought that the problem also applies to 300letters . The cause of all of this is the WordDelimiter filter defined as following: fieldType name=text class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=1 splitOnCaseChange=0 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType and I still don't know why it behaves like that - after all there is preserveOriginal attribute set to 1... W dniu 28.05.2013 14:21, Erick Erickson pisze: Hmmm, with 4.x I get much different behavior than you're describing, what version of Solr are you using? Besides Alex's comments, try adding debug=query to the url and see what comes out from the query parser. A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do any analysis, here's the javadoc... /** * Default analyzer for types that only produces 1 verbatim token... * A maximum size of chars to be read must be specified */ so it's much like the string type. Which means I'm totally perplexed by your statement that 300 and letters return a hit. Have you perhaps changed the field definition and not re-indexed? The behavior you're seeing really looks like somehow WordDelimiterFilterFactory is getting into your analysis chain with settings that don't mash the parts back together, i.e. you can set up WDDF to split on letter/number transitions, index each and NOT index the original, but I have no explanation for how that could happen with the field definition you indicated FWIW, Erick On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitcharafa...@gmail.com arafa...@gmail.com wrote: What does analyzer screen say in the Web AdminUI when you try to do that? Also, what are the tokens stored in the field (also in Web AdminUI). I think it is very strange to have TextField without a tokenizer chain. Maybe you get a standard one assigned by default, but I don't know what the standard chain would be. Regards, Alex. On 28 May 2013 04:44, Michał Matulka michal.matu...@gowork.pl michal.matu...@gowork.pl wrote: Hello, I've got following problem. I have a text type in my schema and a field name of that type. That field contains a data, there is, for example, record
Strange behavior on text field with number-text content
Hello, I've got following problem. I have a text type in my schema and a field name of that type. That field contains a data, there is, for example, record that has 300letters as name. Now field type definition: fieldType name=text class=solr.TextField/fieldType And, of course, field definition: fieldname=nametype=textindexed=truestored=true/ yes, that's all - there are no tokenizers. And now time for my question: Why following queries: name:300 and name:letters are returning that result, but: name:300letters is not (0 results)? Best regards, Michał Matulka
Re: Strange behavior on text field with number-text content
What does analyzer screen say in the Web AdminUI when you try to do that? Also, what are the tokens stored in the field (also in Web AdminUI). I think it is very strange to have TextField without a tokenizer chain. Maybe you get a standard one assigned by default, but I don't know what the standard chain would be. Regards, Alex. On 28 May 2013 04:44, Michał Matulka michal.matu...@gowork.pl wrote: Hello, I've got following problem. I have a text type in my schema and a field name of that type. That field contains a data, there is, for example, record that has 300letters as name. Now field type definition: fieldType name=text class=solr.TextField/**fieldType And, of course, field definition: fieldname=nametype=text**indexed=truestored=true/ yes, that's all - there are no tokenizers. And now time for my question: Why following queries: name:300 and name:letters are returning that result, but: name:300letters is not (0 results)? Best regards, Michał Matulka
Re: Strange behavior on text field with number-text content
Hmmm, with 4.x I get much different behavior than you're describing, what version of Solr are you using? Besides Alex's comments, try adding debug=query to the url and see what comes out from the query parser. A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do any analysis, here's the javadoc... /** * Default analyzer for types that only produces 1 verbatim token... * A maximum size of chars to be read must be specified */ so it's much like the string type. Which means I'm totally perplexed by your statement that 300 and letters return a hit. Have you perhaps changed the field definition and not re-indexed? The behavior you're seeing really looks like somehow WordDelimiterFilterFactory is getting into your analysis chain with settings that don't mash the parts back together, i.e. you can set up WDDF to split on letter/number transitions, index each and NOT index the original, but I have no explanation for how that could happen with the field definition you indicated FWIW, Erick On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: What does analyzer screen say in the Web AdminUI when you try to do that? Also, what are the tokens stored in the field (also in Web AdminUI). I think it is very strange to have TextField without a tokenizer chain. Maybe you get a standard one assigned by default, but I don't know what the standard chain would be. Regards, Alex. On 28 May 2013 04:44, Michał Matulka michal.matu...@gowork.pl wrote: Hello, I've got following problem. I have a text type in my schema and a field name of that type. That field contains a data, there is, for example, record that has 300letters as name. Now field type definition: fieldType name=text class=solr.TextField/**fieldType And, of course, field definition: fieldname=nametype=text**indexed=truestored=true/ yes, that's all - there are no tokenizers. And now time for my question: Why following queries: name:300 and name:letters are returning that result, but: name:300letters is not (0 results)? Best regards, Michał Matulka
Re: Strange behavior on text field with number-text content
Thanks for your responses, I must admit that after hours of trying I made some mistakes. So the most problematic phrase will now be: "4nSolution Inc." which cannot be found using query: name:4nSolution or even name:4nSolution Inc. but can be using following queries: name:nSolution name:4 name:inc Sorry for the mess, it turned out I didn't reindex fields after modyfying schema so I thought that the problem also applies to 300letters . The cause of all of this is the WordDelimiter filter defined as following: fieldType name="text" class="solr.TextField" analyzer type="index" tokenizer class="solr.WhitespaceTokenizerFactory"/ !-- in this example, we will only use synonyms at query time filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" / filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/ filter class="solr.LowerCaseFilterFactory"/ filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/ /analyzer analyzer type="query" tokenizer class="solr.WhitespaceTokenizerFactory"/ filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/ filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" / filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1" / filter class="solr.LowerCaseFilterFactory"/ filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/ /analyzer /fieldType and I still don't know why it behaves like that - after all there is "preserveOriginal" attribute set to 1... W dniu 28.05.2013 14:21, Erick Erickson pisze: Hmmm, with 4.x I get much different behavior than you're describing, what version of Solr are you using? Besides Alex's comments, try adding debug=query to the url and see what comes out from the query parser. A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do any analysis, here's the javadoc... /** * Default analyzer for types that only produces 1 verbatim token... * A maximum size of chars to be read must be specified */ so it's much like the "string" type. Which means I'm totally perplexed by your statement that 300 and letters return a hit. Have you perhaps changed the field definition and not re-indexed? The behavior you're seeing really looks like somehow WordDelimiterFilterFactory is getting into your analysis chain with settings that don't mash the parts back together, i.e. you can set up WDDF to split on letter/number transitions, index each and NOT index the original, but I have no explanation for how that could happen with the field definition you indicated FWIW, Erick On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: What does analyzer screen say in the Web AdminUI when you try to do that? Also, what are the tokens stored in the field (also in Web AdminUI). I think it is very strange to have TextField without a tokenizer chain. Maybe you get a standard one assigned by default, but I don't know what the standard chain would be. Regards, Alex. On 28 May 2013 04:44, "Michał Matulka" michal.matu...@gowork.pl wrote: Hello, I've got following problem. I have a text type in my schema and a field "name" of that type. That field contains a data, there is, for example, record that has "300letters" as name. Now field type definition: fieldType name="text" class="solr.TextField"/**fieldType And, of course, field definition: fieldname="name"type="text"**indexed="true"stored="true"/ yes, that's all - there are no tokenizers. And now time for my
Re: Strange behavior on text field with number-text content
solr-user-unsubscribe solr-user-unsubscr...@lucene.apache.org 2013/5/28 Michał Matulka michal.matu...@gowork.pl Thanks for your responses, I must admit that after hours of trying I made some mistakes. So the most problematic phrase will now be: 4nSolution Inc. which cannot be found using query: name:4nSolution or even name:4nSolution Inc. but can be using following queries: name:nSolution name:4 name:inc Sorry for the mess, it turned out I didn't reindex fields after modyfying schema so I thought that the problem also applies to 300letters . The cause of all of this is the WordDelimiter filter defined as following: fieldType name=text class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=1 splitOnCaseChange=0 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType and I still don't know why it behaves like that - after all there is preserveOriginal attribute set to 1... W dniu 28.05.2013 14:21, Erick Erickson pisze: Hmmm, with 4.x I get much different behavior than you're describing, what version of Solr are you using? Besides Alex's comments, try adding debug=query to the url and see what comes out from the query parser. A quick glance at the code shows that DefaultAnalyzer is used, which doesn't do any analysis, here's the javadoc... /** * Default analyzer for types that only produces 1 verbatim token... * A maximum size of chars to be read must be specified */ so it's much like the string type. Which means I'm totally perplexed by your statement that 300 and letters return a hit. Have you perhaps changed the field definition and not re-indexed? The behavior you're seeing really looks like somehow WordDelimiterFilterFactory is getting into your analysis chain with settings that don't mash the parts back together, i.e. you can set up WDDF to split on letter/number transitions, index each and NOT index the original, but I have no explanation for how that could happen with the field definition you indicated FWIW, Erick On Tue, May 28, 2013 at 7:47 AM, Alexandre Rafalovitcharafa...@gmail.com arafa...@gmail.com wrote: What does analyzer screen say in the Web AdminUI when you try to do that? Also, what are the tokens stored in the field (also in Web AdminUI). I think it is very strange to have TextField without a tokenizer chain. Maybe you get a standard one assigned by default, but I don't know what the standard chain would be. Regards, Alex. On 28 May 2013 04:44, Michał Matulka michal.matu...@gowork.pl michal.matu...@gowork.pl wrote: Hello, I've got following problem. I have a text type in my schema and a field name of that type. That field contains a data, there is, for example, record that has 300letters as name. Now field type definition: fieldType name=text class=solr.TextField/**fieldType And, of course, field definition: fieldname=nametype=text**indexed=truestored=true/ yes, that's all - there are no tokenizers. And now time for my question: Why following queries: name:300 and name:letters are returning that result, but: name:300letters is not (0 results)? Best regards, Michał Matulka -- Pozdrawiam, Michał Matulka Programista michal.matu...@gowork.pl *[image: GoWork.pl]* ul. Zielna 39 00-108 Warszawa www.GoWork.pl
Re: Distributed query: strange behavior.
Eric, Thank you for the explanation. My problem was that allowing the docs with the same unique ids to be present in the multiple shards in a normal situation, makes it impossible to estimate the number of shards needed for an index with a really large number of docs. Thanks, Val On 05/26/2013 11:16 AM, Erick Erickson wrote: Valery: I share your puzzlement. _If_ you are letting Solr do the document routing, and not doing any of the custom routing, then the same unique key should be going to the same shard and replacing the previous doc with that key. But, if you're using custom routing, if you've been experimenting with different configurations and didn't start over, in general if you're configuration is in an interesting state this could happen. So in the normal case if you have a document with the same key indexed in multiple shards, that would indicate a bug. But there are many ways, especially when experimenting, that you could have this happen which are _not_ a bug. I'm guessing that Luis may be trying the custom routing option maybe? Best Erick On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com wrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn
Re: Distributed query: strange behavior.
Hi, Erick! That's it! I'm using a custom implementation of a SolrServer with distributed behavior that routes queries and updates using an in-house Round Robin method. But the thing is that I'm doing this myself because I've noticed that duplicated documents appears using LBHttpSolrServer implementation. Last week I modified my implementation to avoid that with this changes: - I have normalized the key field to all documents. Now every document indexed must include *_id_* field that stores the selected key value. The value is setted with a *copyField*. - When I index a new document a *HttpSolrServer* from the shard list is selected using a Round Robin strategy. Then, a field called *_shard_* is setted to *SolrInputDocument*. That field value includes a relationship with the main shard selected. - If a document wants to be indexed/updated and it includes *_shard_*field to update it automatically the belonged shard ( *HttpSolrServer*) is selected. - If a document wants to be indexed/updated and *_shard_* field is not included then the key value from *_id_* is getted from *SolrInputDocument *. With that key a distributed search query is executed by it's key to retrieve *_shard_* field. With *_shard_* field we can now choose the correct shard (*HttpSolrServer*). It's not a good practice and performance isn't the best, but it's secure. Best Regards, - Luis Cappa 2013/5/26 Erick Erickson erickerick...@gmail.com Valery: I share your puzzlement. _If_ you are letting Solr do the document routing, and not doing any of the custom routing, then the same unique key should be going to the same shard and replacing the previous doc with that key. But, if you're using custom routing, if you've been experimenting with different configurations and didn't start over, in general if you're configuration is in an interesting state this could happen. So in the normal case if you have a document with the same key indexed in multiple shards, that would indicate a bug. But there are many ways, especially when experimenting, that you could have this happen which are _not_ a bug. I'm guessing that Luis may be trying the custom routing option maybe? Best Erick On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com wrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn -- - Luis Cappa
Re: Distributed query: strange behavior.
Hello, guys! Well, I've done some tests and I think that there exists some kind of bug related with distributed search. Currently I'm setting a key field that it's impossible to be duplicated, and I have experienced the same wrong behavior with numFound field while changing rows parameter. Has anyone experienced the same? Best regards, - Luis Cappa 2013/5/27 Luis Cappa Banda luisca...@gmail.com Hi, Erick! That's it! I'm using a custom implementation of a SolrServer with distributed behavior that routes queries and updates using an in-house Round Robin method. But the thing is that I'm doing this myself because I've noticed that duplicated documents appears using LBHttpSolrServer implementation. Last week I modified my implementation to avoid that with this changes: - I have normalized the key field to all documents. Now every document indexed must include *_id_* field that stores the selected key value. The value is setted with a *copyField*. - When I index a new document a *HttpSolrServer* from the shard list is selected using a Round Robin strategy. Then, a field called *_shard_ * is setted to *SolrInputDocument*. That field value includes a relationship with the main shard selected. - If a document wants to be indexed/updated and it includes *_shard_*field to update it automatically the belonged shard ( *HttpSolrServer*) is selected. - If a document wants to be indexed/updated and *_shard_* field is not included then the key value from *_id_* is getted from * SolrInputDocument*. With that key a distributed search query is executed by it's key to retrieve *_shard_* field. With *_shard_* field we can now choose the correct shard (*HttpSolrServer*). It's not a good practice and performance isn't the best, but it's secure. Best Regards, - Luis Cappa 2013/5/26 Erick Erickson erickerick...@gmail.com Valery: I share your puzzlement. _If_ you are letting Solr do the document routing, and not doing any of the custom routing, then the same unique key should be going to the same shard and replacing the previous doc with that key. But, if you're using custom routing, if you've been experimenting with different configurations and didn't start over, in general if you're configuration is in an interesting state this could happen. So in the normal case if you have a document with the same key indexed in multiple shards, that would indicate a bug. But there are many ways, especially when experimenting, that you could have this happen which are _not_ a bug. I'm guessing that Luis may be trying the custom routing option maybe? Best Erick On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com wrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn -- - Luis Cappa -- - Luis Cappa
Re: Distributed query: strange behavior.
Valery: I share your puzzlement. _If_ you are letting Solr do the document routing, and not doing any of the custom routing, then the same unique key should be going to the same shard and replacing the previous doc with that key. But, if you're using custom routing, if you've been experimenting with different configurations and didn't start over, in general if you're configuration is in an interesting state this could happen. So in the normal case if you have a document with the same key indexed in multiple shards, that would indicate a bug. But there are many ways, especially when experimenting, that you could have this happen which are _not_ a bug. I'm guessing that Luis may be trying the custom routing option maybe? Best Erick On Fri, May 24, 2013 at 9:09 AM, Valery Giner valgi...@research.att.com wrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn
Re: Distributed query: strange behavior.
Uhm... that sounds reasonable. My data model may allow duplicate keys, but it's quite difficult. My key is a hash formed by an URL during a crawling process, and it's posible to re-crawl an existing URL. I think that I need to find a new way to compose an unique key to avoid this kind of bad behavior. However, that would be very useful if can Solr alert about duplicate keys or something. Maybe an extra parameter included as a field in the response plus numFound, docs, facets, etc. would be nice. Thank you very much! Best regards, - Luis Cappa 2013/5/23 Shawn Heisey s...@elyograg.org On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn -- - Luis Cappa
Re: Distributed query: strange behavior.
Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn
Re: Distributed query: strange behavior.
The uniqueKey is enforced within the same shard/index only. On Fri, May 24, 2013 at 6:39 PM, Valery Giner valgi...@research.att.comwrote: Shawn, How is it possible for more than one document with the same unique key to appear in the index, even in different shards? Isn't it a bug by definition? What am I missing here? Thanks, Val On 05/23/2013 09:55 AM, Shawn Heisey wrote: On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn -- Regards, Shalin Shekhar Mangar.
Distributed query: strange behavior.
Hello, guys! I'm running Solr 4.3.0 and I've notice an strange behavior during distributed queries execution. Currently I have three Solr servers as shards and I when I do the following query... http://localhost:11080/twitter/data/select?q=*:**rows=10* shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/datawt=jsonhttp://localhost:11080/twitter/data/select?q=*:*rows=10sort=docIndexDate%20descshards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/datawt=json *Numfound* = 47131 I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: http://localhost:11080/twitter/data/select?q=*:**rows=100* shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/datawt=jsonhttp://localhost:11080/twitter/data/select?q=*:*rows=10sort=docIndexDate%20descshards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/datawt=json *Numfound* = 47124 And if i set rows=50 again the numFound count changes: http://localhost:11080/twitter/data/select?q=*:*rows=50shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/datawt=json *Numfound* = 47129 What's happening here? Anybody knows? It's a distributed search bug or something? Thank you very much in advance! Best regards, -- - Luis Cappa
Re: Distributed query: strange behavior.
On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: I've query each Solr shard server one by one and the total number of documents is correct. However, when I change rows parameter from 10 to 100 the total numFound of documents change: I've seen this problem on the list before and the cause has been determined each time to be caused by documents with the same uniqueKey value appearing in more than one shard. What I think happens here: With rows=10, you get the top ten docs from each of the three shards, and each shard sends its numFound for that query to the core that's coordinating the search. The coordinator adds up numFound, looks through those thirty docs, and arranges them according to the requested sort order, returning only the top 10. In this case, there happen to be no duplicates. With rows=100, you get a total of 300 docs. This time, duplicates are found and removed by the coordinator. I think that the coordinator adjusts the total numFound by the number of duplicate documents it removed, in an attempt to be more accurate. I don't know if adjusting numFound when duplicates are found in a sharded query is the right thing to do, I'll leave that for smarter people. Perhaps Solr should return a message with the results saying that duplicates were found, and if a config option is not enabled, the server should throw an exception and return a 4xx HTTP error code. One idea for a config parameter name would be allowShardDuplicates, but something better can probably be found. Thanks, Shawn
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
I'm having a smiliar problem. Did you by any chance try the suggestion here: https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 ? Rakudten wrote More info: - I´m trying to update the document re-indexing the whole document again. I first retrieve the document querying by it´s id, then delete it by it´s id, and re-index including the new changes. - At the same time there are other index writing operations. *RESULT*: in most cases the document wasn´t updated. Bad news... it smells like a critical bug. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda lt; luiscappa@ gt; For more details, my indexation App is: 1. Multithreaded. 2. NRT indexation. 3. It´s a Web App with a REST API. It receives asynchronous requests that produces those atomic updates / document reindexations I told before. I´m pretty sure that the wrong behavior is related with CloudSolrServer and with the fact that maybe you are trying to modify the index while an index update is in course. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda lt; luiscappa@ gt; Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren lt; ssiren@ gt; It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda lt; luiscappa@ gt; wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren lt; ssiren@ gt; I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luiscappa@ wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Very-strange-behavior-when-doing-atomic-updates-or-documents-reindexation-tp4021899p4022250.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Yes! I opened that issue, :-P Next week I'll test with the latest trunk artifacts and check if the problem still happens. Regards, - Luis Cappa. El 25/11/2012 13:35, joe.cohe...@gmail.com joe.cohe...@gmail.com escribió: I'm having a smiliar problem. Did you by any chance try the suggestion here: https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 ? Rakudten wrote More info: - I´m trying to update the document re-indexing the whole document again. I first retrieve the document querying by it´s id, then delete it by it´s id, and re-index including the new changes. - At the same time there are other index writing operations. *RESULT*: in most cases the document wasn´t updated. Bad news... it smells like a critical bug. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda lt; luiscappa@ gt; For more details, my indexation App is: 1. Multithreaded. 2. NRT indexation. 3. It´s a Web App with a REST API. It receives asynchronous requests that produces those atomic updates / document reindexations I told before. I´m pretty sure that the wrong behavior is related with CloudSolrServer and with the fact that maybe you are trying to modify the index while an index update is in course. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda lt; luiscappa@ gt; Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren lt; ssiren@ gt; It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda lt; luiscappa@ gt; wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren lt; ssiren@ gt; I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luiscappa@ wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Very-strange-behavior-when-doing-atomic-updates
SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.comwrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.comwrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
For more details, my indexation App is: 1. Multithreaded. 2. NRT indexation. 3. It´s a Web App with a REST API. It receives asynchronous requests that produces those atomic updates / document reindexations I told before. I´m pretty sure that the wrong behavior is related with CloudSolrServer and with the fact that maybe you are trying to modify the index while an index update is in course. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda luisca...@gmail.com Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
More info: - I´m trying to update the document re-indexing the whole document again. I first retrieve the document querying by it´s id, then delete it by it´s id, and re-index including the new changes. - At the same time there are other index writing operations. *RESULT*: in most cases the document wasn´t updated. Bad news... it smells like a critical bug. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda luisca...@gmail.com For more details, my indexation App is: 1. Multithreaded. 2. NRT indexation. 3. It´s a Web App with a REST API. It receives asynchronous requests that produces those atomic updates / document reindexations I told before. I´m pretty sure that the wrong behavior is related with CloudSolrServer and with the fact that maybe you are trying to modify the index while an index update is in course. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda luisca...@gmail.com Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa
Field names w/ leading digits cause strange behavior
When specifying a field name that starts with a digit (or digits) in the fl parameter solr returns both the field name and field value as the those digits. For example, using nightly build apache-solr-4.0-2012-04-24_08-27-47 I run: java -jar start.jar and java -jar post.jar solr.xml monitor.xml If I then add a field to the field list that starts with a digit ( localhost:8983/solr/select?q=*:*fl=24 ) the results look like: ... doc long name=2424/long /doc ... if I try fl=24_7 it looks like everything after the underscore is truncated ... doc long name=2424/long /doc ... and if I try fl=3test it looks like everything after the last digit is truncated ... doc long name=33/long /doc ... If I have an actual value for that field (say I've indexed 24_7 to be true ) I get back that value as well as the behavior above. ... doc bool name=24_7true/bool long name=2424/long /doc ... Is it ok the have fields that start with digits? If so, is there a different way to specify them using the fl parameter? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Field-names-w-leading-digits-cause-strange-behavior-tp3936354p3936354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Field names w/ leading digits cause strange behavior
Hmmm, this does NOT happen on 3.6, and it DOES happen on trunk. Sure sounds like a JIRA to me, would you mind raising one? I can't imagine this is desired behavior, it's just weird. Thanks for pointing this out! Erick On Tue, Apr 24, 2012 at 3:38 PM, bleakley bleak...@factual.com wrote: When specifying a field name that starts with a digit (or digits) in the fl parameter solr returns both the field name and field value as the those digits. For example, using nightly build apache-solr-4.0-2012-04-24_08-27-47 I run: java -jar start.jar and java -jar post.jar solr.xml monitor.xml If I then add a field to the field list that starts with a digit ( localhost:8983/solr/select?q=*:*fl=24 ) the results look like: ... doc long name=2424/long /doc ... if I try fl=24_7 it looks like everything after the underscore is truncated ... doc long name=2424/long /doc ... and if I try fl=3test it looks like everything after the last digit is truncated ... doc long name=33/long /doc ... If I have an actual value for that field (say I've indexed 24_7 to be true ) I get back that value as well as the behavior above. ... doc bool name=24_7true/bool long name=2424/long /doc ... Is it ok the have fields that start with digits? If so, is there a different way to specify them using the fl parameter? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Field-names-w-leading-digits-cause-strange-behavior-tp3936354p3936354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Field names w/ leading digits cause strange behavior
Thank you for verifying the issue. I've created a ticket at https://issues.apache.org/jira/browse/SOLR-3407 -- View this message in context: http://lucene.472066.n3.nabble.com/Field-names-w-leading-digits-cause-strange-behavior-tp3936354p3936599.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange behavior with search on empty string and NOT
: Would it be a good idea to have Solr throw syntax error if an empty string : query occurs? erick's explanation wasn't very precise ... solr doesn't have any special handling of empty strings, but what you are searching for *might* be a totally valid query based on how the field type is configured (ie: strfield, or keywordtokenizer, etc... in your case, you seem to be seraching for in a field for the analyzer produces no tokens for , so it falls out of the query. -Hoss
Re: Strange behavior with search on empty string and NOT
Would it be a good idea to have Solr throw syntax error if an empty string query occurs? -- View this message in context: http://lucene.472066.n3.nabble.com/Strange-behavior-with-search-on-empty-string-and-NOT-tp3818023p3823572.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange behavior with search on empty string and NOT
Because Lucene query syntax is not a strict Boolean logic system. There's a good explanation here: http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/ Adding debugQuery=on to your search is your friend G.. You'll see that your return (at least on 3.5 with going at /solr/select) returns this as the parsed query: str name=parsedquery-name:foobar/str Solr really doesn't have the semantics for empty strings (or NULL for that matter) so it just gets dropped out. Best Erick On Sun, Mar 11, 2012 at 11:36 PM, Lan dung@gmail.com wrote: I am curious why solr results are inconsistent for the query below for an empty string search on a TextField. q=name: returns 0 results q=name: AND NOT name:FOOBAR return all results in the solr index. Should it should not return 0 results too? Here is the debugQuery. response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=debugQueryon/str str name=indenton/str str name=start0/str str name=qname: AND NOT name:BLAH232282/str str name=rows0/str str name=version2.2/str /lst /lst result name=response numFound=3790790 start=0/ lst name=debug str name=rawquerystringname: AND NOT name:BLAH232282/str str name=querystringname: AND NOT name:BLAH232282/str str name=parsedquery-PhraseQuery(name:blah 232282)/str str name=parsedquery_toString-name:blah 232282/str lst name=explain/ str name=QParserLuceneQParser/str lst name=timing double name=time1.0/double lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst lst name=process double name=time0.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response -- View this message in context: http://lucene.472066.n3.nabble.com/Strange-behavior-with-search-on-empty-string-and-NOT-tp3818023p3818023.html Sent from the Solr - User mailing list archive at Nabble.com.
Strange behavior with search on empty string and NOT
I am curious why solr results are inconsistent for the query below for an empty string search on a TextField. q=name: returns 0 results q=name: AND NOT name:FOOBAR return all results in the solr index. Should it should not return 0 results too? Here is the debugQuery. response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=debugQueryon/str str name=indenton/str str name=start0/str str name=qname: AND NOT name:BLAH232282/str str name=rows0/str str name=version2.2/str /lst /lst result name=response numFound=3790790 start=0/ lst name=debug str name=rawquerystringname: AND NOT name:BLAH232282/str str name=querystringname: AND NOT name:BLAH232282/str str name=parsedquery-PhraseQuery(name:blah 232282)/str str name=parsedquery_toString-name:blah 232282/str lst name=explain/ str name=QParserLuceneQParser/str lst name=timing double name=time1.0/double lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst lst name=process double name=time0.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response -- View this message in context: http://lucene.472066.n3.nabble.com/Strange-behavior-with-search-on-empty-string-and-NOT-tp3818023p3818023.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: strange behavior of scores and term proximity use
You might try with a less fraught search phrase, to be or not to be is a classic query that may be all stop words. Otherwise, I'm clueless. On Wed, Nov 23, 2011 at 3:15 PM, Ariel Zerbib ariel.zer...@gmail.com wrote: I tested with the version 4.0-2011-11-04_09-29-42. Ariel 2011/11/17 Erick Erickson erickerick...@gmail.com Hmmm, I'm not seeing similar behavior on a trunk from today, when did you get your copy? Erick On Wed, Nov 16, 2011 at 2:06 PM, Ariel Zerbib ariel.zer...@gmail.com wrote: Hi, For this term proximity query: ab_main_title_l0:to be or not to be~1000 http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=truehttp://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22%7E1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=true The third first results are the following one: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime5/int /lst result name=response numFound=318 start=0 maxScore=3.0814114 doc long name=id2315190010001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be a Jew. 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2313006480001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2356410250001021/long arr name=ab_main_title_l0 strog54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2/str /arr float name=score3.0814114/float/doc /result lst name=debug str name=rawquerystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=querystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=parsedqueryPhraseQuery(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000)/str str name=parsedquery_toStringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str lst name=explain str name=2315190010001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 378403) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 378403, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=378403) /str str name=2313006480001021 9.244234 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 482807) [DefaultSimilarity], result of: 9.244234 = fieldWeight in 482807, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=482807) /str str name=2356410250001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 1317563) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 1317563, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=1317563) /str /response The used version is a 4.0 October snapshot. I have 2 questions about the result: - Why debug print and scores in result are different? - What is the expected behavior of this kind of term proximity query? - The debug scores seem to be well ordered but the result scores seem to be wrong. Thanks, Ariel
Re: strange behavior of scores and term proximity use
I tested with the version 4.0-2011-11-04_09-29-42. Ariel 2011/11/17 Erick Erickson erickerick...@gmail.com Hmmm, I'm not seeing similar behavior on a trunk from today, when did you get your copy? Erick On Wed, Nov 16, 2011 at 2:06 PM, Ariel Zerbib ariel.zer...@gmail.com wrote: Hi, For this term proximity query: ab_main_title_l0:to be or not to be~1000 http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=truehttp://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22%7E1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=true The third first results are the following one: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime5/int /lst result name=response numFound=318 start=0 maxScore=3.0814114 doc long name=id2315190010001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be a Jew. 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2313006480001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2356410250001021/long arr name=ab_main_title_l0 strog54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2/str /arr float name=score3.0814114/float/doc /result lst name=debug str name=rawquerystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=querystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=parsedqueryPhraseQuery(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000)/str str name=parsedquery_toStringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str lst name=explain str name=2315190010001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 378403) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 378403, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=378403) /str str name=2313006480001021 9.244234 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 482807) [DefaultSimilarity], result of: 9.244234 = fieldWeight in 482807, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=482807) /str str name=2356410250001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 1317563) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 1317563, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=1317563) /str /response The used version is a 4.0 October snapshot. I have 2 questions about the result: - Why debug print and scores in result are different? - What is the expected behavior of this kind of term proximity query? - The debug scores seem to be well ordered but the result scores seem to be wrong. Thanks, Ariel
Re: strange behavior of scores and term proximity use
Hmmm, I'm not seeing similar behavior on a trunk from today, when did you get your copy? Erick On Wed, Nov 16, 2011 at 2:06 PM, Ariel Zerbib ariel.zer...@gmail.com wrote: Hi, For this term proximity query: ab_main_title_l0:to be or not to be~1000 http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=true The third first results are the following one: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime5/int /lst result name=response numFound=318 start=0 maxScore=3.0814114 doc long name=id2315190010001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be a Jew. 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2313006480001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2356410250001021/long arr name=ab_main_title_l0 strog54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2/str /arr float name=score3.0814114/float/doc /result lst name=debug str name=rawquerystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=querystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=parsedqueryPhraseQuery(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000)/str str name=parsedquery_toStringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str lst name=explain str name=2315190010001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 378403) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 378403, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=378403) /str str name=2313006480001021 9.244234 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 482807) [DefaultSimilarity], result of: 9.244234 = fieldWeight in 482807, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=482807) /str str name=2356410250001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 1317563) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 1317563, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=1317563) /str /response The used version is a 4.0 October snapshot. I have 2 questions about the result: - Why debug print and scores in result are different? - What is the expected behavior of this kind of term proximity query? - The debug scores seem to be well ordered but the result scores seem to be wrong. Thanks, Ariel
strange behavior of scores and term proximity use
Hi, For this term proximity query: ab_main_title_l0:to be or not to be~1000 http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000sort=score+descstart=0rows=3fl=ab_main_title_l0%2Cscore%2CiddebugQuery=true The third first results are the following one: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime5/int /lst result name=response numFound=318 start=0 maxScore=3.0814114 doc long name=id2315190010001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be a Jew. 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2313006480001021/long arr name=ab_main_title_l0 strog54ct8n To be or not to be 5w8ojsx2/str /arr float name=score3.0814114/float/doc doc long name=id2356410250001021/long arr name=ab_main_title_l0 strog54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2/str /arr float name=score3.0814114/float/doc /result lst name=debug str name=rawquerystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=querystringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str str name=parsedqueryPhraseQuery(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000)/str str name=parsedquery_toStringab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000/str lst name=explain str name=2315190010001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 378403) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 378403, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=378403) /str str name=2313006480001021 9.244234 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 482807) [DefaultSimilarity], result of: 9.244234 = fieldWeight in 482807, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=482807) /str str name=2356410250001021 5.337161 = (MATCH) weight(ab_main_title_l0:og54ct8n to be or not to be 5w8ojsx2~1000 in 1317563) [DefaultSimilarity], result of: 5.337161 = fieldWeight in 1317563, product of: 0.57735026 = tf(freq=0.3334), with freq of: 0.3334 = phraseFreq=0.3334 29.581549 = idf(), sum of: 1.0012436 = idf(docFreq=3297332, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 4.3826413 = idf(docFreq=112108, maxDocs=3301436) 6.3982043 = idf(docFreq=14937, maxDocs=3301436) 3.0405464 = idf(docFreq=429046, maxDocs=3301436) 5.3583193 = idf(docFreq=42257, maxDocs=3301436) 1.0017256 = idf(docFreq=3295743, maxDocs=3301436) 0.3125 = fieldNorm(doc=1317563) /str /response The used version is a 4.0 October snapshot. I have 2 questions about the result: - Why debug print and scores in result are different? - What is the expected behavior of this kind of term proximity query? - The debug scores seem to be well ordered but the result scores seem to be wrong. Thanks, Ariel
Re: Strange behavior
Have you stopped Solr before manually copying the data? This way you can be sure that index is the same and you didn't have any new docs on the fly. 2011/6/14 Denis Kuzmenok forward...@ukr.net: What should i provide, OS is the same, environment is the same, solr is completely copied, searches work, except that one, and that is strange.. I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Strange behavior
Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Re: Strange behavior
I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Re: Strange behavior
What should i provide, OS is the same, environment is the same, solr is completely copied, searches work, except that one, and that is strange.. I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Re: Strange behavior
Well, you could provide the results with debugQuery=on. You could provide the schema.xml and solrconfig.xml files for both. You could provide a listing of your index files. You could provide some evidence that you've tried chasing down your problem using tools like Luke or the Solr admin interface. Something please... You might also review: http://wiki.apache.org/solr/UsingMailingLists Best Erick 2011/6/14 Denis Kuzmenok forward...@ukr.net: What should i provide, OS is the same, environment is the same, solr is completely copied, searches work, except that one, and that is strange.. I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
strange behavior of echoParams
Dear list, after setting echoParams to none wildcard search isn't working. Only if I set echoParams to explicit then wildcard is possible. http://wiki.apache.org/solr/CoreQueryParameters states that echoParams is for debugging purposes. We use Solr 3.1.0. Snippet from solrconfig.xml: requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsnone/str !-- str name=echoParamsexplicit/str -- str name=wtxml/str int name=rows10/int /lst /requestHandler Any explanation about this behavior? Regards, Bernd
Re: strange behavior of echoParams
What does the parsed query look like with debugQuery=true for both scenarios? Any difference? Doesn't make any sense that echoParams would have an effect, unless somehow your search client is relying on parameters returned to do something with them.?! Erik On Apr 13, 2011, at 09:57 , Bernd Fehling wrote: Dear list, after setting echoParams to none wildcard search isn't working. Only if I set echoParams to explicit then wildcard is possible. http://wiki.apache.org/solr/CoreQueryParameters states that echoParams is for debugging purposes. We use Solr 3.1.0. Snippet from solrconfig.xml: requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsnone/str !-- str name=echoParamsexplicit/str -- str name=wtxml/str int name=rows10/int /lst /requestHandler Any explanation about this behavior? Regards, Bernd
Re: strange behavior of echoParams
Hi Erik, never mind. Can't reproduce this strange behavior. Obviously stopping and starting of solr solved this. Thanks, Bernd Am 13.04.2011 16:00, schrieb Erik Hatcher: What does the parsed query look like with debugQuery=true for both scenarios? Any difference? Doesn't make any sense that echoParams would have an effect, unless somehow your search client is relying on parameters returned to do something with them.?! Erik On Apr 13, 2011, at 09:57 , Bernd Fehling wrote: Dear list, after setting echoParams to none wildcard search isn't working. Only if I set echoParams to explicit then wildcard is possible. http://wiki.apache.org/solr/CoreQueryParameters states that echoParams is for debugging purposes. We use Solr 3.1.0. Snippet from solrconfig.xml: requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsnone/str !--str name=echoParamsexplicit/str -- str name=wtxml/str int name=rows10/int /lst /requestHandler Any explanation about this behavior? Regards, Bernd
RE: Strange behavior for certain words
Hi, Thanks for your response. Attached are the Schema.xml and sample docs that were indexed. The query and response are as below. The attachment Prodsku4270257.xml has a field paymenttype whose value is 'prepaid'. query: q=prepaidstart=0rows=10fl=*%2Cscoreqt=standardwt=jsondebugQuery=onexplainOther=hl=on But you are populating your text field from deviceType, features, description and color. paymentType is not copied into text. So this behavior is normal. Either add this copy field declaration copyField source=paymentType dest=text / Or query directly this field: q=paymentType:prepaid
Strange behavior for certain words
Hi, We are trying to use SOLR for searching our catalog online and during QA came across a interesting case where SOLR is not returning results that it should. Specificially, we have indexed things like Title and Description, of the words in the Title happens to be Prepaid' and Postpaid. However when we search on those words, SOLR does not return any results. But if we search on some other words in the same title in which the word Prepaid occurs then the correct results are returned. In fact SOLR even returns the result count for the Prepaid and Postpaid facets. We know that there are no synonyms associated with both those words and these words are also not in any other list such as stopwords.txt etc. Any idea as to why this should be happening ? Thanks in advance, Rama
Re: Strange behavior for certain words
Hmmm, there's not much information to go on here. You might review this page: http://wiki.apache.org/solr/UsingMailingLists and post with more information. At minimum, the field definitions, the query output (include debugQuery=on), perhaps what comes out of the analysis admin page for both indexing and querying the problem text, and whatever else you can think of that would help analyze the problem. Best Erick On Wed, May 12, 2010 at 8:26 PM, RamaKrishna Atmakur ramkrishn...@hotmail.com wrote: Hi, We are trying to use SOLR for searching our catalog online and during QA came across a interesting case where SOLR is not returning results that it should. Specificially, we have indexed things like Title and Description, of the words in the Title happens to be Prepaid' and Postpaid. However when we search on those words, SOLR does not return any results. But if we search on some other words in the same title in which the word Prepaid occurs then the correct results are returned. In fact SOLR even returns the result count for the Prepaid and Postpaid facets. We know that there are no synonyms associated with both those words and these words are also not in any other list such as stopwords.txt etc. Any idea as to why this should be happening ? Thanks in advance, Rama
RE: Strange behavior for certain words
Hi, Thanks for your response. Attached are the Schema.xml and sample docs that were indexed. The query and response are as below. The attachment Prodsku4270257.xml has a field paymenttype whose value is 'prepaid'. query: q=prepaidstart=0rows=10fl=*%2Cscoreqt=standardwt=jsondebugQuery=onexplainOther=hl=on Result: { responseHeader:{ status:0, QTime:0, params:{ wt:json, debugQuery:on, start:0, rows:10, explainOther:, indent:on, fl:*,score, hl:on, qt:standard, version:2.2, q:prepaid, hl.fl:}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, highlighting:{}, debug:{ rawquerystring:prepaid, querystring:prepaid, parsedquery:text:prepaid, parsedquery_toString:text:prepaid, explain:{}, QParser:OldLuceneQParser, timing:{ time:0.0, prepare:{ time:0.0, org.apache.solr.handler.component.QueryComponent:{ time:0.0}, org.apache.solr.handler.component.FacetComponent:{ time:0.0}, org.apache.solr.handler.component.MoreLikeThisComponent:{ time:0.0}, org.apache.solr.handler.component.HighlightComponent:{ time:0.0}, org.apache.solr.handler.component.DebugComponent:{ time:0.0}}, process:{ time:0.0, org.apache.solr.handler.component.QueryComponent:{ time:0.0}, org.apache.solr.handler.component.FacetComponent:{ time:0.0}, org.apache.solr.handler.component.MoreLikeThisComponent:{ time:0.0}, org.apache.solr.handler.component.HighlightComponent:{ time:0.0}, org.apache.solr.handler.component.DebugComponent:{ time:0.0} Thanks and Regards Rama K Atmakur. Date: Wed, 12 May 2010 20:46:11 -0400 Subject: Re: Strange behavior for certain words From: erickerick...@gmail.com To: solr-user@lucene.apache.org Hmmm, there's not much information to go on here. You might review this page: http://wiki.apache.org/solr/UsingMailingLists and post with more information. At minimum, the field definitions, the query output (include debugQuery=on), perhaps what comes out of the analysis admin page for both indexing and querying the problem text, and whatever else you can think of that would help analyze the problem. Best Erick On Wed, May 12, 2010 at 8:26 PM, RamaKrishna Atmakur ramkrishn...@hotmail.com wrote: Hi, We are trying to use SOLR for searching our catalog online and during QA came across a interesting case where SOLR is not returning results that it should. Specificially, we have indexed things like Title and Description, of the words in the Title happens to be Prepaid' and Postpaid. However when we search on those words, SOLR does not return any results. But if we search on some other words in the same title in which the word Prepaid occurs then the correct results are returned. In fact SOLR even returns the result count for the Prepaid and Postpaid facets. We know that there are no synonyms associated with both those words and these words are also not in any other list such as stopwords.txt etc. Any idea as to why this should be happening ? Thanks in advance, Rama ?xml version=1.0 encoding=UTF-8 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- !-- This is the Solr schema file. This file should be named schema.xml and should be in the conf directory under the solr home (i.e. ./solr/conf/schema.xml by default) or located where the classloader for the Solr webapp can find it. This example schema is the recommended starting point for users. It should be kept correct and concise, usable out-of-the-box. For more information, on how to customize this file, please see http://wiki.apache.org/solr/SchemaXml -- schema name=attcatalog version=1.1 !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.1 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications
RE: Strange behavior for certain words
Hi Rama, What field types are these Title and Description? You may go to SOLR admin console and try Analysis, and select the field type that you have used for Title and Description and provide those words Prepaid and Postpaid in the indexing analyzer and see how is it storing the information. regards, Naga Ranjan -Original Message- From: RamaKrishna Atmakur [mailto:ramkrishn...@hotmail.com] Sent: Thursday, May 13, 2010 5:57 AM To: solr-user@lucene.apache.org Subject: Strange behavior for certain words Hi, We are trying to use SOLR for searching our catalog online and during QA came across a interesting case where SOLR is not returning results that it should. Specificially, we have indexed things like Title and Description, of the words in the Title happens to be Prepaid' and Postpaid. However when we search on those words, SOLR does not return any results. But if we search on some other words in the same title in which the word Prepaid occurs then the correct results are returned. In fact SOLR even returns the result count for the Prepaid and Postpaid facets. We know that there are no synonyms associated with both those words and these words are also not in any other list such as stopwords.txt etc. Any idea as to why this should be happening ? Thanks in advance, Rama
Re: Strange Behavior When Using CSVRequestHandler
Erick - thanks very much, all of this makes sense. But the one thing I still find puzzling is the fact that re-adding the file a second, third, fourth etc time causes numDocs to increase, and ALWAYS by the same amount (141,645). Any ideas as to what could cause that? Dan Erick Erickson wrote: I think the root of your problem is that unique fields should NOT be multivalued. See http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)In this case, since you're tokenizing, your query field is implicitly multi-valued, I don't know what the behavior will be. But there's another problem: All the filters in your analyzer definition will mess up the correspondence between the Unix uniq and numDocs even if you got by the above. I.e StopFilter would make the lines a problem and the problem identical. WordDelimiter would do all kinds of interesting things LowerCaseFilter would make Myproblem and myproblem identical. RemoveDuplicatesFilter would make interesting interesting and interesting identical You could define a second field, make *that* one unique and NOT analyzer it in any way... You could hash your sentences and define the hash as your unique key. You could HTH Erick On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote: The problem: Not all of the documents that I expect to be indexed are showing up in the index. The background: I start off with an empty index based on a schema with a single field named 'query', marked as unique and using the following analyzer: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer My input is a utf-8 encoded file with one sentence per line. Its total size is about 60MB. I would like each line of the file to correspond to a single document in the solr index. If I print the number of unique lines in the file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing the total number of lines in the file gives me around 2.7M. I use the following to start indexing: curl ' http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape= \' When this command completes, I see numDocs is approximately 470k (which is what I find strange) and maxDocs is approximately 890k (which is fine since I know I have around 700k duplicates). Even more confusing is that if I run this exact command a second time without performing any other operations, numDocs goes up to around 610k, and a third time brings it up to about 750k. Can anyone tell me what might cause Solr not to index everything in my input file the first time, and why it would be able to index new documents the second and third times? I also have this line in solrconfig.xml, if it matters: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / Thanks, Dan -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange Behavior When Using CSVRequestHandler
It puzzles me too. I don't know the internals of that code well enough to speculate, but once you're into undefined behavior, I have great faith in *many* inexplicable things happening. Erick On Thu, Jan 7, 2010 at 9:45 AM, danben dan...@gmail.com wrote: Erick - thanks very much, all of this makes sense. But the one thing I still find puzzling is the fact that re-adding the file a second, third, fourth etc time causes numDocs to increase, and ALWAYS by the same amount (141,645). Any ideas as to what could cause that? Dan Erick Erickson wrote: I think the root of your problem is that unique fields should NOT be multivalued. See http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) In this case, since you're tokenizing, your query field is implicitly multi-valued, I don't know what the behavior will be. But there's another problem: All the filters in your analyzer definition will mess up the correspondence between the Unix uniq and numDocs even if you got by the above. I.e StopFilter would make the lines a problem and the problem identical. WordDelimiter would do all kinds of interesting things LowerCaseFilter would make Myproblem and myproblem identical. RemoveDuplicatesFilter would make interesting interesting and interesting identical You could define a second field, make *that* one unique and NOT analyzer it in any way... You could hash your sentences and define the hash as your unique key. You could HTH Erick On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote: The problem: Not all of the documents that I expect to be indexed are showing up in the index. The background: I start off with an empty index based on a schema with a single field named 'query', marked as unique and using the following analyzer: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer My input is a utf-8 encoded file with one sentence per line. Its total size is about 60MB. I would like each line of the file to correspond to a single document in the solr index. If I print the number of unique lines in the file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing the total number of lines in the file gives me around 2.7M. I use the following to start indexing: curl ' http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape= \' When this command completes, I see numDocs is approximately 470k (which is what I find strange) and maxDocs is approximately 890k (which is fine since I know I have around 700k duplicates). Even more confusing is that if I run this exact command a second time without performing any other operations, numDocs goes up to around 610k, and a third time brings it up to about 750k. Can anyone tell me what might cause Solr not to index everything in my input file the first time, and why it would be able to index new documents the second and third times? I also have this line in solrconfig.xml, if it matters: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / Thanks, Dan -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html Sent from the Solr - User mailing list archive at Nabble.com.
Strange Behavior When Using CSVRequestHandler
The problem: Not all of the documents that I expect to be indexed are showing up in the index. The background: I start off with an empty index based on a schema with a single field named 'query', marked as unique and using the following analyzer: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer My input is a utf-8 encoded file with one sentence per line. Its total size is about 60MB. I would like each line of the file to correspond to a single document in the solr index. If I print the number of unique lines in the file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing the total number of lines in the file gives me around 2.7M. I use the following to start indexing: curl 'http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape=\' When this command completes, I see numDocs is approximately 470k (which is what I find strange) and maxDocs is approximately 890k (which is fine since I know I have around 700k duplicates). Even more confusing is that if I run this exact command a second time without performing any other operations, numDocs goes up to around 610k, and a third time brings it up to about 750k. Can anyone tell me what might cause Solr not to index everything in my input file the first time, and why it would be able to index new documents the second and third times? I also have this line in solrconfig.xml, if it matters: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / Thanks, Dan -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange Behavior When Using CSVRequestHandler
I think the root of your problem is that unique fields should NOT be multivalued. See http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key) http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)In this case, since you're tokenizing, your query field is implicitly multi-valued, I don't know what the behavior will be. But there's another problem: All the filters in your analyzer definition will mess up the correspondence between the Unix uniq and numDocs even if you got by the above. I.e StopFilter would make the lines a problem and the problem identical. WordDelimiter would do all kinds of interesting things LowerCaseFilter would make Myproblem and myproblem identical. RemoveDuplicatesFilter would make interesting interesting and interesting identical You could define a second field, make *that* one unique and NOT analyzer it in any way... You could hash your sentences and define the hash as your unique key. You could HTH Erick On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote: The problem: Not all of the documents that I expect to be indexed are showing up in the index. The background: I start off with an empty index based on a schema with a single field named 'query', marked as unique and using the following analyzer: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer My input is a utf-8 encoded file with one sentence per line. Its total size is about 60MB. I would like each line of the file to correspond to a single document in the solr index. If I print the number of unique lines in the file (using cat | sort | uniq | wc -l), I get a little over 2M. Printing the total number of lines in the file gives me around 2.7M. I use the following to start indexing: curl ' http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape= \' When this command completes, I see numDocs is approximately 470k (which is what I find strange) and maxDocs is approximately 890k (which is fine since I know I have around 700k duplicates). Even more confusing is that if I run this exact command a second time without performing any other operations, numDocs goes up to around 610k, and a third time brings it up to about 750k. Can anyone tell me what might cause Solr not to index everything in my input file the first time, and why it would be able to index new documents the second and third times? I also have this line in solrconfig.xml, if it matters: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / Thanks, Dan -- View this message in context: http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
UPDATE: Crazy staff with SLES10 SP2 default installation/partitioning, LVM (Logical Volume Manager) shows 400Gb available, but... I lost 90% of index without even noticing that! Aug 16, 2009 8:04:32 PM org.apache.solr.common.SolrException log SEVERE: java.io.IOException: No space left on device at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java) - then somehow no any exceptions without few hours, no any corrupted index after several commits, then again not enough space, etc.; finally corrupted index (still, SATA) Thanks -Original Message- From: Funtick [mailto:f...@efendi.ca] Sent: August-18-09 12:25 AM To: solr-user@lucene.apache.org Subject: Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared... sorry for typo in prev msg, Increase = 2,297,231 - 1,786,552 = 500,000 (average) RATE (non-unique-id:unique-id) = 7,000,000 : 500,000 = 14:1 but 125:1 (initial 30 hours) was very strange... Funtick wrote: UPDATE: After few more minutes (after previous commit): docsPending: about 7,000,000 After commit: numDocs: 2,297,231 Increase = 2,297,231 - 1,281,851 = 1,000,000 (average) So that I have 7 docs with same ID in average. Having 100,000,000 and then dropping below 1,000,000 is strange; it is a bug somewhere... need to investigate ramBufferSize and MergePolicy, including SOLR uniqueId implementation... Funtick wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-D ocuments-disappeared...-tp25017728p25018263.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR uniqueKey - extremely strange behavior! Documents disappeared...
After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
I'd say you have a lot of documents that have the same id. When you add a doc with the same id, first the old one is deleted, then the new one is added (atomically though). The deleted docs are not removed from the index immediately though - the doc id is just marked as deleted. Over time though, as segments are merged due to hitting triggers while adding new documents, deletes are removed (which deletes depends on which segments have been merged). So if you add a tone of documents over time, many with the same ids, you would likely see this type of maxDoc, numDoc churn. maxDoc will include deleted docs while numDoc will not. -- - Mark http://www.lucidimagination.com On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
But how to explain that within an hour (after commit) I have had about 500,000 new documents, and within 30 hours (after commit) only 1,300,000? Same _random_enough_ documents... BTW, SOLR Console was showing only few hundreds deletesById although I don't use any deleteById explicitly; only update with allowOverwrite and uniqueId. markrmiller wrote: I'd say you have a lot of documents that have the same id. When you add a doc with the same id, first the old one is deleted, then the new one is added (atomically though). The deleted docs are not removed from the index immediately though - the doc id is just marked as deleted. Over time though, as segments are merged due to hitting triggers while adding new documents, deletes are removed (which deletes depends on which segments have been merged). So if you add a tone of documents over time, many with the same ids, you would likely see this type of maxDoc, numDoc churn. maxDoc will include deleted docs while numDoc will not. -- - Mark http://www.lucidimagination.com On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017826.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
One more hour, and I have +0.5 mlns more (after commit/optimize) Something strange happening with SOLR buffer flush (if we have single segment???)... explicit commit prevents it... 30 hours, with index flush, commit: 783,714 + 1 hour, commit, optimize: 1,281,851 + 1 hour, commit, optimize: 1,786,552 Same random docs retrieved from web... Funtick wrote: But how to explain that within an hour (after commit) I have had about 500,000 new documents, and within 30 hours (after commit) only 783,714? Same _random_enough_ documents... BTW, SOLR Console was showing only few hundreds deletesById although I don't use any deleteById explicitly; only update with allowOverwrite and uniqueId. markrmiller wrote: I'd say you have a lot of documents that have the same id. When you add a doc with the same id, first the old one is deleted, then the new one is added (atomically though). The deleted docs are not removed from the index immediately though - the doc id is just marked as deleted. Over time though, as segments are merged due to hitting triggers while adding new documents, deletes are removed (which deletes depends on which segments have been merged). So if you add a tone of documents over time, many with the same ids, you would likely see this type of maxDoc, numDoc churn. maxDoc will include deleted docs while numDoc will not. -- - Mark http://www.lucidimagination.com On Mon, Aug 17, 2009 at 11:09 PM, Funtick f...@efendi.ca wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
UPDATE: After few more minutes (after previous commit): docsPending: about 7,000,000 After commit: numDocs: 2,297,231 Increase = 2,297,231 - 1,281,851 = 1,000,000 (average) So that I have 7 docs with same ID in average. Having 100,000,000 and then dropping below 1,000,000 is strange; it is a bug somewhere... need to investigate ramBufferSize and MergePolicy, including SOLR uniqueId implementation... Funtick wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25018221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR uniqueKey - extremely strange behavior! Documents disappeared...
sorry for typo in prev msg, Increase = 2,297,231 - 1,786,552 = 500,000 (average) RATE (non-unique-id:unique-id) = 7,000,000 : 500,000 = 14:1 but 125:1 (initial 30 hours) was very strange... Funtick wrote: UPDATE: After few more minutes (after previous commit): docsPending: about 7,000,000 After commit: numDocs: 2,297,231 Increase = 2,297,231 - 1,281,851 = 1,000,000 (average) So that I have 7 docs with same ID in average. Having 100,000,000 and then dropping below 1,000,000 is strange; it is a bug somewhere... need to investigate ramBufferSize and MergePolicy, including SOLR uniqueId implementation... Funtick wrote: After running an application which heavily uses MD5 HEX-representation as uniqueKey for SOLR v.1.4-dev-trunk: 1. After 30 hours: 101,000,000 documents added 2. Commit: numDocs = 783,714 maxDoc = 3,975,393 3. Upload new docs to SOLR during 1 hour(!!!), then commit, then optimize: numDocs=1,281,851 maxDocs=1,281,851 It looks _extremely_ strange that within an hour I have such a huge increase with same 'average' document set... I am suspecting something goes wrong with Lucene buffer flush / index merge OR SOLR - Unique ID handling... According to my own estimates, I should have about 10,000,000 new documents now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same 'random' documents. This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb. Why? I haven't issued any commit... I am using ramBufferMB=8192 -- View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25018263.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strange behavior
On Feb 12, 2008 9:50 AM, Traut [EMAIL PROTECTED] wrote: Thank you, it works. Stemming filter works only with lowercased words? I've never tried it in the order you have it. You could try the analysis admin page and report back what happens... -Yonik On Feb 12, 2008 4:29 PM, Yonik Seeley [EMAIL PROTECTED] wrote: Try putting the stemmer after the lowercase filter. -Yonik On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote: Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut -- Best regards, Traut
Strange behavior
Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut
Re: Strange behavior
Thank you, it works. Stemming filter works only with lowercased words? On Feb 12, 2008 4:29 PM, Yonik Seeley [EMAIL PROTECTED] wrote: Try putting the stemmer after the lowercase filter. -Yonik On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote: Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut -- Best regards, Traut
Re: Strange behavior
Try putting the stemmer after the lowercase filter. -Yonik On Feb 12, 2008 9:15 AM, Traut [EMAIL PROTECTED] wrote: Hi all Please take a look at this strange behavior (connected with stemming I suppose): type: fieldtype name=customTextField class=solr.TextField indexed=true stored=false analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field: field name=name type=customTextField indexed=true stored=false/ I'm adding a document: adddocfield name=id99/fieldfield name=nameApple/field/doc/add commit/ Queriyng name:apple - 0 results. Searching name:Apple - 1 result. But name:appl* - 1 result Adding next document: adddocfield name=id8/fieldfield name=nameSomenamele/field/doc/add commit/ Searching for name:somenamele - 1 result, for name:Somenamele - 1 result What is the problem with Apple ? Maybe StandardTokenizer understands it as trademark :) ? Thank you in advence -- Best regards, Traut
Re: Strange behavior MoreLikeThis Feature
Now when I run the following query: http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on try adding: debugQuery=on to your query string and you can see why each document matches... My guess is that features uses a text field with stemming and a stemmed word matches ryan
Re: Strange behavior MoreLikeThis Feature
Thanks Ryan. I now know the reason why. Before I explain the reason, let me correct the mistake I made in my earlier mail. I was not using the first document mentioned in the xml . Instead it was this one: doc field name=idIW-02/field field name=nameiPod amp; iPod Mini USB 2.0 Cable/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter for iPod, white/field field name=weight2/field field name=price11.50/field field name=popularity1/field field name=inStockfalse/field /doc The reason I was getting strange result was because of the character i. Here is what I learnt from debug info: debug:{ rawquerystring:id:neardup06, querystring:id:neardup06, parsedquery:features:og features:en features:til features:er features:af features:der features:ts features:se features:i features:p features:pet features:brag features:efter features:zombier features:k features:tilbag features:ala features:sviner features:folk features:klassisk features:resid features:horder features:lidt features:man features:denn, parsedquery_toString:features:og features:en features:til features:er features:af features:der features:ts features:se features:i features:p features:pet features:brag features:efter features:zombier features:k features:tilbag features:ala features:sviner features:folk features:klassisk features:resid features:horder features:lidt features:man features:denn, explain:{ id=IW-02,internal_docid=8:\n0.0050230525 = (MATCH) product of:\n 0.12557632 = (MATCH) sum of:\n0.12557632 = (MATCH) weight(features:i in 8), product of:\n 0.17474915 = queryWeight(features:i), product of:\n1.9162908 = idf(docFreq=3)\n0.09119135 = queryNorm\n 0.71860904 = (MATCH) fieldWeight(features:i in 8), product of:\n1.0 = tf(termFreq(features:i)=1)\n1.9162908 = idf(docFreq=3)\n 0.375 = fieldNorm(field=features, doc=8)\n 0.04 = coord(1/25)\n}}} The field features uses the default fieldtype - text in the schema.xml. The problem was solved by adding the character i to the stopwords.txtfile. the is in document 2 were matched with the i in iPod of document 1. I still have to figure out why a single character - i - matched the i in a word - iPod. Regards, Rishabh On 22/11/2007, Ryan McKinley [EMAIL PROTECTED] wrote: Now when I run the following query: http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on try adding: debugQuery=on to your query string and you can see why each document matches... My guess is that features uses a text field with stemming and a stemmed word matches ryan