Questions for SynonymGraphFilter and WordDelimiterGraphFilter
Hello, We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and WordDelimiterFilter have been deprecated. Solr doc recommends to use SynonymGraphFilter and WordDelimiterGraphFilter instead. In current schema, we have text field type defined as In the index phase we have both SynonymFilter and WordDelimiterFilter configured: Solr documentation states that "graph filters produces correct token graphs, but cannot consume an input token graph correctly. When use these two graph filter during indexing, you must follow it with a FlattenGraphFilter". I am confused as how to replace them with the new SynonymGraphFilter and WordDelimiterGraphFilter. A few questions: 1. Regarding the FlattenGraphFilter, is it to be used only once or multiple times after each graph filter? Can we have the configure like this? 2. Is it possible to we have two graph filters, i.e. both SynonymGraphFilter and WordDelimiterGraphFilter in the same analysis chain? If not what's the best option to replace our current config? 3. With the StopFilterFactory in between SynonymGraphFilter and WordDelimiterGraphFilter, I get a few index errors: Exception writing document id XX to the index; possible analysis error Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 But if I move StopFilter before the SynonymGraphFilter the errors are gone. I guess the StopFilter mess up the SynonymGraphFilter output? Not sure if it's a solr defect or there is a guideline that StopFilter should not be put after graph filters. Thanks in advance for you input. Thanks, Wei
Re: Solr relevancy score different on replicated nodes
Ashish: Deleting and re-adding a replica is not a solution. Even if you did, that would then be identical only until you started indexing again, then the stats could skew a bit. When you index to NRT replicas, the wall clock times that cause the commits to trigger will be different due to network delays. What happens essentially is that the doc gets indexed to the leader at time X but hits the replica Y milliseconds later. So on leader, the autocommit interval expires at time X+Z (Z being your autocommit interval) but X+Y+Z on the follower. However, some additional docs may have already been indexed on the leader but not yet on the follower when the autocommit trigger happens so the newly-closed segment on the leader can have docs that the newly-closed segment on the follower does not have. the point is that the termfreq does _not_ change when a document is deleted in some segment (and remember that an update is really a delete followed by an add). The data associated with deleted docs is not purged until segments are merged. Further, the decision about which segments to merge is influenced by how many documents are deleted in each. All of which means that the tf/idf statistics are different (slightly) and you either have to use destributed IDF or just live with it. You're saying that the document count of live documents is different, and that's more concerning. Is this true for brief intervals or is it true when there is _no_ indexing going on _and_ your autocommit interval is allowed to expire? In that case it's a different problem. However, if the condition is transitory and goes away if you stop indexing, then it's the same issue I outlined above; autocommit is happening at different wall-clock times. Best, Erick On Fri, Jan 4, 2019 at 11:12 AM Ashish Bisht wrote: > > Hi Erick, > > I have updated that I am not facing this problem in a new collection. > > As per 3) I can try deleting a replica and adding it again, but the > confusion is which one out of two should I delete.(wondering which replica > is giving correct score for query) > > Both replicas give same number of docs while doing all query.Its strange > that in query explain docCount and docFreq is differing. > > Regards > Ashish > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Warnings in Zookeeper Server Logs
Hi (yes again): We have a simple architecture: 2 SOLR Cloud servers (on servers #1 and #2), and 3 zookeeper instances (on servers #1, #2, and #3). Things appear to work fine, and I have confirmed that our basic configuration is correct. But we are seeing TONS of the following warnings in all of our zookeeper server logs: 2019-01-04 14:48:04,266 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /XXX.YY.ZZZ.46:51516 2019-01-04 14:48:04,266 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:748) 2019-01-04 14:48:04,266 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1044] - Closed socket connection for client /XXX.YY.ZZZ.46:51516 (no session established for client) These messages seem to correspond to similar message we are seeing in the application client-side logs. (I don’t see any messages that would indicate Too many connections.) Reading the log content, it seems to be saying that a connection is accepted, but then there is an "end of stream" exception. But our users are not experiencing any problems--they are searching SOLR like crazy. Any suggestions? Thanks! Joe -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Time consuming for insert record
On 12/25/2018 11:23 PM, jay harkhani wrote: We are using add method of CloudSolrClient for insert data into Solr Cloud Index. In specific scenario we need to insert record of around 3 MB document into Solr which takes 5-6 seconds. Is this a single document that's 3 MB in size, or many documents totaling 3 MB? If it's a single document, there's probably little you can do to make it faster other than shrinking the document. If it's many documents, then the way to increase speed would be to use multiple threads or multiple processes to index documents in parallel. If the "3 MB" you have mentioned means 3 million documents, then 5-6 seconds is *VERY* fast already, and it would be extremely difficult to improve on that. Thanks, Shawn
Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report
On 1/3/2019 11:15 AM, Bob Hathaway wrote: We want to use SOLR v7 but Sonatype scans past v6.5 show dozens of critical and severe security issues and dozens of licensing issues. None of the images that you attached to your message are visible to us. Attachments are regularly stripped by Apache mailing lists and cannot be relied on. Some of the security issues you've mentioned could be problems. But if you follow recommendations and make sure that Solr is not directly accessible to unauthorized parties, it will not be possible for those parties to exploit security issues without first finding and exploiting a vulnerability on an authorized system. Vulnerabilities in SolrJ, if any exist, are slightly different, but unless unauthorized parties have the ability to *directly* send input to SolrJ code without intermediate code sanitizing the input, they will not be able to exploit those vulnerabilities. JSON support in SolrJ is provided by noggit, not jackson, and JSON/XML are not used by recent versions of SolrJ unless they are very specifically requested by the programmer. Are there any vulnerabilities you've found that affect SolrJ itself, separately from the rest of Solr? As we become aware of issues with either project code or third-party software, we get them fixed. Sometimes it is not completely straightforward to upgrade to newer versions of third-party software, but staying current is a priority. Licensing issues are of major concern to the entire Apache Foundation. As a project, we are unaware of any licensing problems at this time. All of the third-party software that is included with Solr should be available under a license that is compatible with the Apache license. I didn't examine the list you sent super closely, but what I did look at didn't look like a problem. https://www.apache.org/legal/resolved.html#category-b The mere presence of GPL in the available licenses for third party software is not an indication of a problem. If that were the ONLY license available, then it would be a problem. Thanks, Shawn
Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report
Jackson-databind is actually not such an old version. The problem with Jackson databind is that for deserialization it has just a blacklist of objects not to deserialize and it is impossible to maintain that blacklist uptodate. For version 3.0 they change to a whitelist approach it seems which will resolve those errors. Until then all future versions of databind based on a blacklist approach are vulnerable. BTW this is for all applications using that library. Spring security has put on top of that additional items on the blacklist so even if nexusiq shows a security issue with databind but you have introduced additional means (eg you or another have worked on the blacklist) to be less vulnerable - nexusiq can’t know. Btw this is also what they explain when you open the detail of the security assessment. Then, it depends on how you deploy software such as solr in your enterprise environment and they risks related to that. Eg one could have introduced means as above. Most of the users usually don’t have direct access to Solr itself but through a custom application, so there is no “direct” attack possible. Finally, the absence of findings in the report does not mean an application is secure. > Am 04.01.2019 um 19:27 schrieb Gus Heck : > > Hi Bob, > > Wrt licensing keep in mind that multi licensed software allows you to > choose which license you are using the software under. Also there's some > good detail on the Apache policy here: > > https://www.apache.org/legal/resolved.html#what-can-we-not-include-in-an-asf-project-category-x > > One has to be careful with license scanners, often they have very > conservative settings. I had to spend untold hours getting jfrog's license > plugin to select the correct license and hunting down missing licenses when > I finally sorted out licensing for JesterJ. (though MANY fewer hours than > if I had done this by hand!) > >> On Fri, Jan 4, 2019, 11:17 AM Bob Hathaway > >> The most important feature of any software running today is that it can be >> run at all. Security vulnerabilities can preclude software from running in >> enterprise environments. Today software must be free of critical and severe >> security vulnerabilities or they can't be run at all from Information >> Security policies. Enterprises today run security scan software to check >> for security and licensing vulnerabilities because today most organizations >> are using open source software where this has become most relevant. >> Forrester has a good summary on the need for software composition analysis >> tools which virtually all enterprises run today befor allowing software to >> run in production environments: >> >> https://www.blackducksoftware.com/sites/default/files/images/Downloads/Reports/USA/ForresterWave-Rpt.pdf >> >> Solr version 6.5 passes security scans showing no critical security >> issues. Solr version 7 fails security scans with over a dozen critical and >> severe security vulnerabilities for Solr version from 7.1. Then we ran >> scans against the latest Solr version 7.6 which failed as well. Most of >> the issues are due to using old libraries including the JSON Jackson >> framework, Dom 4j and Xerces and should be easy to bring up to date. Only >> the latest version of SimpleXML has severe security vulnerabilities. Derby >> leads the most severe security violations at Level 9.1 by using an out of >> date version. >> >> What good is software or any features if enterprises can't run them? >> Today software cybersecurity is a top priority and risk for enterprises. >> Solr version 6.5 is very old exposing the zookeeper backend from the SolrJ >> client which is a differentiating capability. >> >> Is security and remediation a priority for SolrJ? I believe this should be >> a top feature to allow SolrJ to continue providing search features to >> enterprises and a security roadmap and plan to keep Solr secure and usable >> by continually adapting and improving in the ever changing security >> landscape and ecosystem. The Darby vulnerability issue CVE-2015-1832 was a >> passing medium Level 6.2 issue in CVSS 2.0 last year but is the most >> critical issue with Solr 7.6 at Level 9.1 in this year's CVSS 3.0. These >> changes need to be tracked and updates and fixes incorporated into new Solr >> versions. >> https://nvd.nist.gov/vuln/detail/CVE-2015-1832 >> >>> On Thu, Jan 3, 2019 at 12:19 PM Bob Hathaway wrote: >>> >>> Critical and Severe security vulnerabilities against Solr v7.1. Many of >>> these appear to be from old open source framework versions. >>> >>> *9* CVE-2017-7525 com.fasterxml.jackson.core : jackson-databind : 2.5.4 >>> Open >>> >>> CVE-2016-131 commons-fileupload : commons-fileupload : 1.3.2 Open >>> >>> CVE-2015-1832 org.apache.derby : derby : 10.9.1.0 Open >>> >>> CVE-2017-7525 org.codehaus.jackson : jackson-mapper-asl : 1.9.13 Open >>> >>> CVE-2017-7657 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open >>> >>>
Re: Solr relevancy score different on replicated nodes
Hi Erick, I have updated that I am not facing this problem in a new collection. As per 3) I can try deleting a replica and adding it again, but the confusion is which one out of two should I delete.(wondering which replica is giving correct score for query) Both replicas give same number of docs while doing all query.Its strange that in query explain docCount and docFreq is differing. Regards Ashish -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report
Hi Bob, Wrt licensing keep in mind that multi licensed software allows you to choose which license you are using the software under. Also there's some good detail on the Apache policy here: https://www.apache.org/legal/resolved.html#what-can-we-not-include-in-an-asf-project-category-x One has to be careful with license scanners, often they have very conservative settings. I had to spend untold hours getting jfrog's license plugin to select the correct license and hunting down missing licenses when I finally sorted out licensing for JesterJ. (though MANY fewer hours than if I had done this by hand!) On Fri, Jan 4, 2019, 11:17 AM Bob Hathaway The most important feature of any software running today is that it can be > run at all. Security vulnerabilities can preclude software from running in > enterprise environments. Today software must be free of critical and severe > security vulnerabilities or they can't be run at all from Information > Security policies. Enterprises today run security scan software to check > for security and licensing vulnerabilities because today most organizations > are using open source software where this has become most relevant. > Forrester has a good summary on the need for software composition analysis > tools which virtually all enterprises run today befor allowing software to > run in production environments: > > https://www.blackducksoftware.com/sites/default/files/images/Downloads/Reports/USA/ForresterWave-Rpt.pdf > > Solr version 6.5 passes security scans showing no critical security > issues. Solr version 7 fails security scans with over a dozen critical and > severe security vulnerabilities for Solr version from 7.1. Then we ran > scans against the latest Solr version 7.6 which failed as well. Most of > the issues are due to using old libraries including the JSON Jackson > framework, Dom 4j and Xerces and should be easy to bring up to date. Only > the latest version of SimpleXML has severe security vulnerabilities. Derby > leads the most severe security violations at Level 9.1 by using an out of > date version. > > What good is software or any features if enterprises can't run them? > Today software cybersecurity is a top priority and risk for enterprises. > Solr version 6.5 is very old exposing the zookeeper backend from the SolrJ > client which is a differentiating capability. > > Is security and remediation a priority for SolrJ? I believe this should be > a top feature to allow SolrJ to continue providing search features to > enterprises and a security roadmap and plan to keep Solr secure and usable > by continually adapting and improving in the ever changing security > landscape and ecosystem. The Darby vulnerability issue CVE-2015-1832 was a > passing medium Level 6.2 issue in CVSS 2.0 last year but is the most > critical issue with Solr 7.6 at Level 9.1 in this year's CVSS 3.0. These > changes need to be tracked and updates and fixes incorporated into new Solr > versions. > https://nvd.nist.gov/vuln/detail/CVE-2015-1832 > > On Thu, Jan 3, 2019 at 12:19 PM Bob Hathaway wrote: > > > Critical and Severe security vulnerabilities against Solr v7.1. Many of > > these appear to be from old open source framework versions. > > > > *9* CVE-2017-7525 com.fasterxml.jackson.core : jackson-databind : 2.5.4 > > Open > > > >CVE-2016-131 commons-fileupload : commons-fileupload : 1.3.2 Open > > > >CVE-2015-1832 org.apache.derby : derby : 10.9.1.0 Open > > > >CVE-2017-7525 org.codehaus.jackson : jackson-mapper-asl : 1.9.13 Open > > > >CVE-2017-7657 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open > > > >CVE-2017-7658 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open > > > >CVE-2017-1000190 org.simpleframework : simple-xml : 2.7.1 Open > > > > *7* sonatype-2016-0397 com.fasterxml.jackson.core : jackson-core : 2.5.4 > > Open > > > >sonatype-2017-0355 com.fasterxml.jackson.core : jackson-core : 2.5.4 > > Open > > > >CVE-2014-0114 commons-beanutils : commons-beanutils : 1.8.3 Open > > > >CVE-2018-1000632 dom4j : dom4j : 1.6.1 Open > > > >CVE-2018-8009 org.apache.hadoop : hadoop-common : 2.7.4 Open > > > >CVE-2017-12626 org.apache.poi : poi : 3.17-beta1 Open > > > >CVE-2017-12626 org.apache.poi : poi-scratchpad : 3.17-beta1 Open > > > >CVE-2018-1308 org.apache.solr : solr-dataimporthandler : 7.1.0 Open > > > >CVE-2016-4434 org.apache.tika : tika-core : 1.16 Open > > > >CVE-2018-11761 org.apache.tika : tika-core : 1.16 Open > > > >CVE-2016-1000338 org.bouncycastle : bcprov-jdk15 : 1.45 Open > > > >CVE-2016-1000343 org.bouncycastle : bcprov-jdk15 : 1.45 Open > > > >CVE-2018-1000180 org.bouncycastle : bcprov-jdk15 : 1.45 Open > > > >CVE-2017-7656 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open > > > >CVE-2012-0881 xerces : xercesImpl : 2.9.1 Open > > > >CVE-2013-4002 xerces : xercesImpl : 2.9.1 Open > > > > On Thu, Jan 3, 2019 at 12:15 PM Bob Hathaway > wrote: > > > >>
RE: [solr-solrcloud] How does DIH work when there are multiple nodes?
DIH is also not designed to multi-thread very well. One way I've handled this is to have a DIH XML that breaks-up a database query into multiple processes by taking the modulo of a row, as follows: This allows me to do sub-queries within the entity, but it is often better to just write a small program to get this data from the database, and ETL processors such as Pentaho DI (Kettle) and Talend DI do this quite well. If you can express what you want in a database view, even a complicated one, then your best way to get it into Solr IMO is to use logstash with the jdbc input plugin. It can do some transformation, but you'll need your database view to process the data. > -Original Message- > From: Shawn Heisey > Sent: Friday, January 4, 2019 12:25 PM > To: solr-user@lucene.apache.org > Subject: Re: [solr-solrcloud] How does DIH work when there are multiple > nodes? > > On 1/4/2019 1:04 AM, 유정인 wrote: > > The reader was looking for a way to do 'DIH' automatically. > > > > The reason was for HA configuration. > > If you send a DIH request to the collection (as opposed to a specific > core), that request will be load balanced across the cloud. You won't > know which replica/core actually handles it. This means that an import > command may be handled by a different host than a status command. In > that situation, the status command will not know about the import, > because it will be running on a different Solr core. > > When doing DIH on SolrCloud, you should send your requests directly to a > specific core on a specific node. It's the only way to be sure what's > happening. High availability would have to be handled in your application. > > Thanks, > Shawn
Re: [solr-solrcloud] How does DIH work when there are multiple nodes?
On 1/4/2019 1:04 AM, 유정인 wrote: The reader was looking for a way to do 'DIH' automatically. The reason was for HA configuration. If you send a DIH request to the collection (as opposed to a specific core), that request will be load balanced across the cloud. You won't know which replica/core actually handles it. This means that an import command may be handled by a different host than a status command. In that situation, the status command will not know about the import, because it will be running on a different Solr core. When doing DIH on SolrCloud, you should send your requests directly to a specific core on a specific node. It's the only way to be sure what's happening. High availability would have to be handled in your application. Thanks, Shawn
Re: Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG
On 1/3/2019 11:26 PM, Doss wrote: We are planning to setup a SOLR cloud with 6 nodes for 3 million records (expected to grow to 5 million in a year), with 150 fields and over all index would come around 120GB. We plan to use NRT with 5 sec soft commit and 1 min hard commit. Five seconds is likely far too short an interval. That's something you'll have to experiment with. Expected query volume would be 5000 select hits per second and 7000 inserts / updates per second. 5000 queries per second is an extremely high query rate. I would guess that six nodes is far too few to handle that much of a query load. It might also be plenty ... it's nearly impossible to gauge that with the information you've shared so far. Usually the only way to find out for sure is to actually BUILD the system and try it. 7000 documents inserted per second is also ambitious. It's achievable, but is almost certainly going to require parallel threads/processes indexing at the same time. That's going to reduce the query volume you can handle. If you expect 3 million documents to reach 120GB of index size, then each of those documents must be fairly large. Large documents will index more slowly, and can also reduce query capacity. Memory will be your biggest challenge. If a Solr instance must handle 120GB of index and achieve a high query volume, then you'll want that Solr instance to have about 128GB of memory, so the entire index will fit into the operating system disk cache. Our records can be classified under 15 categories, but they will not have even number of records, few categories will have more number of records. Queries will also come in the same pattern, that is., categories with high number of records will get high volume of select / updates. For this situation we are confused in choosing what type of sharding would help us in better performance in both select and updates? Composite / implicit - Composite with 15 shards or implicit based on 15 categories. 15 shards is probably far too many for only a few million documents, especially with the extremely high query volume and low host count you have projected. With a high query volume, you want the absolute minimum number of shards possible ... one if you can. Handling several million documents in a single shard is usually doable. Our select queries will have minimum 15 filters in fq, with extensive function queries used in sort. When a query has multiple filters, they will generally all be run in parallel, not sequentially. This can affect the query volume you can handle, it's very difficult to know whether the effect will be helpful or harmful. For our kind of situation which replica Type can we choose? All NRT or NRT with TLOG ? If you will only have two replicas, they should both be either NRT or TLOG. With more than two replicas, my suggestion would be to make two of them TLOG and the rest PULL. One of the TLOG replicas will be elected leader, and all other replicas will copy the index from the leader, rather than do the independent indexing that NRT replicas do. Thanks, Shawn
Re: Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG
It's usually best to use compositeId routing. That distributes the load evenly. Otherwise, _you_ have to be responsible for making sure that the docs are reasonably evenly distributed, which can be a pain. Implicit routing is usually best in situations where you index to a particular shard for a while then move on to another shard, think news stories where you want to keep them for 30 days then dispose of them. Implicit lets you add/remove shards on a daily basis. Doesn't sound particularly suitable for your situation. But I do have to ask why you're sharding at all? 5M docs is a fairly small index by modern standards. There's some inevitable overhead with sharding that you could avoid. Mostly I'm asking if you've stress-tested with that query and update rate. The 7,000 updates/second do worry me a bit with a single-shard solution, but if you get adequate response times under that load, then there's no need to shard. Use all the hardware to support querying. Sharding will improve indexing throughput without doubt, Solr scales roughly linearly with the number of shards. Do use CloudSolrClient for your updates as it routes docs to the correct leader, avoiding one extra hop. Given your soft commit setting of 5 seconds, I infer that the allowable time for updates to be searchable is quite small, indicating that NRT replicas are the way to go. I'll also say that this commit rate is pretty aggressive given your volume, is it really necessary to be that short? Your caches are going to be pretty useless since they won't stick around for very long. Look carefully at the autowarming time, in order to make any good use of your fitlerCache, you'll have to autowarm it some and if you do, you need to insure that the autowarm interval is less than your autocommit time. Best, Erick On Thu, Jan 3, 2019 at 10:34 PM Doss wrote: > > Hi, > > We are planning to setup a SOLR cloud with 6 nodes for 3 million records > (expected to grow to 5 million in a year), with 150 fields and over all > index would come around 120GB. > > We plan to use NRT with 5 sec soft commit and 1 min hard commit. > > Expected query volume would be 5000 select hits per second and 7000 inserts > / updates per second. > > Our records can be classified under 15 categories, but they will not have > even number of records, few categories will have more number of records. > > Queries will also come in the same pattern, that is., categories with high > number of records will get high volume of select / updates. > > For this situation we are confused in choosing what type of sharding would > help us in better performance in both select and updates? > > Composite / implicit - Composite with 15 shards or implicit based on 15 > categories. > > Our select queries will have minimum 15 filters in fq, with extensive > function queries used in sort. > > Updates will have 6 integer fields, 5 string fields and 4 string/integer > fields with multi valued. > > If we choose implicit to boost select performance, our updates will be > heavy on few shards (major category shards), will this be a problem? > > For our kind of situation which replica Type can we choose? All NRT or NRT > with TLOG ? > > Thanks in advance! > > Best, > Doss.
Re: How to debug empty ParsedQuery from Edismax Query Parser
I'd like to follow up on this post here because it has become relevant to me now. I have set up a debugging environment and took a deep-dive into the SOLR 7.6.0 source code with Eclipse as my IDE of choice for this task. I have isolated the exact line as to where things fall apart for my two sample queries that I have been testing with, which are "q=a3f*" and "q=aa3f*. As you can see here, the only visible difference between the two search terms are that the second search term has two characters in succession before switching to a numerical portion. First things first, the Extended Dismax Query Parser hands over portions of the parsing to the Standard Query Parser early on the the parsing process. Following down the rabbit hole, I ended up in SolrQueryParserBase.getPrefixQuery() method. On line 1173 of this method, we have the following statement: termStr = analyzeIfMultitermTermText(field, termStr, schema.getFieldType(field)); This statement, when executing with the "a3f" search term, returns "a3f" as a result. However, when using "aa3f", it throws a SolrException with excatly the same multi-term error as shown below, only like this: > analyzer returned too many terms for multiTerm term: aa3f At this point, I would like to reiterate the purpose of our search: we are a part number house. We deal with millions of part numbers in our system and on our web site. A customer of ours typically searches our site with a given part number (or SKU if you will). Some part numbers are intelligent, and so customers might reduce the part number string to a portion at the beginning. Either way, it is *not* a typical "word" based search. Yet, the system (Drupal) does treat those two query fields like standard "Text" search fields. Those who know Drupal Commerce will recognize the Title field of a node and also possible the Product Variation or (SKU) field. With that in mind, multi-term was introduced with SOLR 5, and I think this error (or limitation) has probably been in SOLR 5 since then. Can anyone closer to the matter or having struggled with this same issue chime in on the subject? Kind regards, Kay > On Dec 28, 2018, at 9:57 AM, Kay Wrobel wrote: > > Here are my log entries: > > SOLR 7.x (non-working) > 2018-12-28 15:36:32.786 INFO (qtp1769193365-20) [ x:collection1] > o.a.s.c.S.Request [collection1] webapp=/solr path=/select > params={q=ac6023*=tm_field_product^21.0=tm_title_field^8.0=all=10=xml=true} > hits=0 status=0 QTime=2 > > SOLR 4.x (working) > INFO - 2018-12-28 15:43:41.938; org.apache.solr.core.SolrCore; [collection1] > webapp=/solr path=/select > params={q=ac6023*=tm_field_product^21.0=tm_title_field^8.0=all=10=xml=true} > hits=32 status=0 QTime=8 > > EchoParams=all did not show anything different in the resulting XML from SOLR > 7.x. > > > I found out something curious yesterday. When I try to force the Standard > query parser on SOLR 7.x using the same query, but adding "defType=lucene" at > the beginning, SOLR 7 raises a SolrException with this message: "analyzer > returned too many terms for multiTerm term: ac6023" (full response: > https://pastebin.com/ijdBj4GF) > > Log entry for that request: > 2018-12-28 15:50:58.804 ERROR (qtp1769193365-15) [ x:collection1] > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: analyzer > returned too many terms for multiTerm term: ac6023 >at > org.apache.solr.schema.TextField.analyzeMultiTerm(TextField.java:180) >at > org.apache.solr.parser.SolrQueryParserBase.analyzeIfMultitermTermText(SolrQueryParserBase.java:992) >at > org.apache.solr.parser.SolrQueryParserBase.getPrefixQuery(SolrQueryParserBase.java:1173) >at > org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:781) >at org.apache.solr.parser.QueryParser.Term(QueryParser.java:421) >at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:278) >at org.apache.solr.parser.QueryParser.Query(QueryParser.java:162) >at > org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:131) >at > org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:254) >at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:49) >at org.apache.solr.search.QParser.getQuery(QParser.java:173) >at > org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:160) >at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:279) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541) >at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709) >at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515) >at >
Re: Solr relevancy score different on replicated nodes
Replicated segments might have different deleted documents by design. Precise numbers can be achieved via exact stats. see https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_ On Fri, Jan 4, 2019 at 2:40 PM AshB wrote: > Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes > Machine-1,Machine-2 > holding single instances of solr > > We are having a collection which was single shard and single replica i.e > s=1 > and rf=1 > > Few days back we tried to add replica to it.But the score for same query is > coming different from different replicas. > > > http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[ > > whereas on another machine(replica) > > > http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[ > > The maxScore is different. > > Relevancy gets affected due to sharding but replication was not expected as > same documents get copied to other node. score explaination gives issue > with > docCount and docFreq uneven. > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) > from: > 1.050635000 docCount :*10020.0* docFreq :*3504.000* > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) > from: > 1.068795100 > > docCount :*10291.0* docFreq :*3534.000* > > Is this expected?What could be wrong here?Please suggest > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > -- Sincerely yours Mikhail Khludnev
Re: So Many Zookeeper Warnings--There Must Be a Problem
How brave are you? ;) I'll defer to Scott on the internals of ZK and why it might be necessary to delete the ZK data dirs, but what happens if you just correct your configuration and drive on? If that doesn't work here's something to try Shut down your Solr instances, then. - bin/solr zk cp -r zk:/ some_local_dir - fix your ZK, perhaps blowing the data directories away and bring the ZK servers back up. - bin/solr zk cp -r some_local_dir zk:/ Start your Solr instances. NOTE: if you've configured your solr info with a "chroot", the ZK path will be slightly different. NOTE: I'm going from memory on the exact form of those commands. bin/solr -help should show you the info WARNING: This worked at some point in the past, but is _not_ "officially" supported, it was just a happy consequence of code to copy data from ZK and back to replace the zkCli functionality, creating one less thing for Solr users to have to keep track of. What that does is copy the cluster status relevant to Solr from then back to ZK. DO NOT change your Solr data in any way when doing this. What this is trying to do is copy all the topology information in ZK. Assuming the Solr nodes haven't changed, have the same IP address etc. it _might_ work for you. Best, Erick On Fri, Jan 4, 2019 at 4:25 AM Joe Lerner wrote: > > wrt, "You'll probably have to delete the contents of the zk data directory > and rebuild your collections." > > Rebuild my *SOLR* collections? That's easy enough for us. > > If this is how we're incorrectly configured now: > > server #1 = myid#1 > server #2 = myid#2 > server #3 = myid#2 > > My plan would be to do the following, while users are still online (it's a > big [bad] deal if we need to take search offline): > > 1. Take zk #3 down. > 2. Fix zk #3 by deleting the contents of the zk data directory and assign it > myid#3 > 3. Bring zk#3 back up > 4. Do a full re-build of all collections > > Thanks! > > Joe > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: So Many Zookeeper Warnings--There Must Be a Problem
On 1/4/2019 5:24 AM, Joe Lerner wrote: server #1 = myid#1 server #2 = myid#2 server #3 = myid#2 My plan would be to do the following, while users are still online (it's a big [bad] deal if we need to take search offline): 1. Take zk #3 down. 2. Fix zk #3 by deleting the contents of the zk data directory and assign it myid#3 3. Bring zk#3 back up 4. Do a full re-build of all collections There should be no need to rebuild anything in Solr once zookeeper is repaired in this fashion. The third zookeeper will replicate data from whichever of the other two has won the leader election. A three-node zookeeper ensemble is 100% functional with two nodes running. You would only need to rebuild the Solr side if all data on the zookeeper side were lost. I would not expect this action to lose any data in zookeeper. The info you tried to share about your log messages in the original post for this thread did not come through. I do not see it either on the mailing list or in the Nabble mirror. It does look like you started another thread which does have the info. I will address those messages in that thread. Thanks, Shawn
Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report
The most important feature of any software running today is that it can be run at all. Security vulnerabilities can preclude software from running in enterprise environments. Today software must be free of critical and severe security vulnerabilities or they can't be run at all from Information Security policies. Enterprises today run security scan software to check for security and licensing vulnerabilities because today most organizations are using open source software where this has become most relevant. Forrester has a good summary on the need for software composition analysis tools which virtually all enterprises run today befor allowing software to run in production environments: https://www.blackducksoftware.com/sites/default/files/images/Downloads/Reports/USA/ForresterWave-Rpt.pdf Solr version 6.5 passes security scans showing no critical security issues. Solr version 7 fails security scans with over a dozen critical and severe security vulnerabilities for Solr version from 7.1. Then we ran scans against the latest Solr version 7.6 which failed as well. Most of the issues are due to using old libraries including the JSON Jackson framework, Dom 4j and Xerces and should be easy to bring up to date. Only the latest version of SimpleXML has severe security vulnerabilities. Derby leads the most severe security violations at Level 9.1 by using an out of date version. What good is software or any features if enterprises can't run them? Today software cybersecurity is a top priority and risk for enterprises. Solr version 6.5 is very old exposing the zookeeper backend from the SolrJ client which is a differentiating capability. Is security and remediation a priority for SolrJ? I believe this should be a top feature to allow SolrJ to continue providing search features to enterprises and a security roadmap and plan to keep Solr secure and usable by continually adapting and improving in the ever changing security landscape and ecosystem. The Darby vulnerability issue CVE-2015-1832 was a passing medium Level 6.2 issue in CVSS 2.0 last year but is the most critical issue with Solr 7.6 at Level 9.1 in this year's CVSS 3.0. These changes need to be tracked and updates and fixes incorporated into new Solr versions. https://nvd.nist.gov/vuln/detail/CVE-2015-1832 On Thu, Jan 3, 2019 at 12:19 PM Bob Hathaway wrote: > Critical and Severe security vulnerabilities against Solr v7.1. Many of > these appear to be from old open source framework versions. > > *9* CVE-2017-7525 com.fasterxml.jackson.core : jackson-databind : 2.5.4 > Open > >CVE-2016-131 commons-fileupload : commons-fileupload : 1.3.2 Open > >CVE-2015-1832 org.apache.derby : derby : 10.9.1.0 Open > >CVE-2017-7525 org.codehaus.jackson : jackson-mapper-asl : 1.9.13 Open > >CVE-2017-7657 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open > >CVE-2017-7658 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open > >CVE-2017-1000190 org.simpleframework : simple-xml : 2.7.1 Open > > *7* sonatype-2016-0397 com.fasterxml.jackson.core : jackson-core : 2.5.4 > Open > >sonatype-2017-0355 com.fasterxml.jackson.core : jackson-core : 2.5.4 > Open > >CVE-2014-0114 commons-beanutils : commons-beanutils : 1.8.3 Open > >CVE-2018-1000632 dom4j : dom4j : 1.6.1 Open > >CVE-2018-8009 org.apache.hadoop : hadoop-common : 2.7.4 Open > >CVE-2017-12626 org.apache.poi : poi : 3.17-beta1 Open > >CVE-2017-12626 org.apache.poi : poi-scratchpad : 3.17-beta1 Open > >CVE-2018-1308 org.apache.solr : solr-dataimporthandler : 7.1.0 Open > >CVE-2016-4434 org.apache.tika : tika-core : 1.16 Open > >CVE-2018-11761 org.apache.tika : tika-core : 1.16 Open > >CVE-2016-1000338 org.bouncycastle : bcprov-jdk15 : 1.45 Open > >CVE-2016-1000343 org.bouncycastle : bcprov-jdk15 : 1.45 Open > >CVE-2018-1000180 org.bouncycastle : bcprov-jdk15 : 1.45 Open > >CVE-2017-7656 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open > >CVE-2012-0881 xerces : xercesImpl : 2.9.1 Open > >CVE-2013-4002 xerces : xercesImpl : 2.9.1 Open > > On Thu, Jan 3, 2019 at 12:15 PM Bob Hathaway wrote: > >> We want to use SOLR v7 but Sonatype scans past v6.5 show dozens of >> critical and severe security issues and dozens of licensing issues. The >> critical security violations using Sonatype are inline and are indexed with >> codes from the National Vulnerability Database, >> >> Are there recommended steps for running Solr 7 in secure enterprises >> specifically infosec remediation over Sonatype Application Composition >> Reports? >> >> Are there plans to make Solr more secure in v7 or v8? >> >> I'm new to the Solr User forum and suggests are welcome. >> >> >> Sonatype Application Composition Reports >> Of Solr - 7.6.0, Build Scanned On Thu Jan 03 2019 at 14:49:49 >> Using Scanner 1.56.0-01 >> >> [image: image.png] >> >> [image: image.png] >> >> [image: image.png] >> >> Security Issues >> Threat Level Problem Code Component Status
Re: Solr relevancy score different on replicated nodes
See particularly point 3 here and to a lesser extent point 2. https://support.lucidworks.com/s/question/0D5803LRpijCAD/the-number-of-results-returned-is-not-constant-every-time-i-query-solr For point two (the internal Lucene doc IDs are different) you can easily correct it by adding sort=score desc, solrId asc to the query. That article was written before TLOG and PULL replicas came into the picture. Since those replica types all have the exact same index structure you shouldn't have this problem in that case. Best, Erick On Fri, Jan 4, 2019 at 3:40 AM AshB wrote: > > Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2 > holding single instances of solr > > We are having a collection which was single shard and single replica i.e s=1 > and rf=1 > > Few days back we tried to add replica to it.But the score for same query is > coming different from different replicas. > > http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[ > > whereas on another machine(replica) > > http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json > > "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[ > > The maxScore is different. > > Relevancy gets affected due to sharding but replication was not expected as > same documents get copied to other node. score explaination gives issue with > docCount and docFreq uneven. > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: > 1.050635000 docCount :*10020.0* docFreq :*3504.000* > > idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: > 1.068795100 > > docCount :*10291.0* docFreq :*3534.000* > > Is this expected?What could be wrong here?Please suggest > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Continuous Zookeeper Client Warnings
Hi, We have a simple architecture: 2 SOLR Cloud servers (on servers #1 and #2), and 3 zookeeper instances (on servers #1, #2, and #3). Things appear to work fine but: We are getting *TONS* of continuous log warnings from our client applications. From one server it shows this: [MYAPP-WEB] 2019-01-03 14:17:46,519 WARN [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:46,519 WARN [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:47,385 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:47,386 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:47,386 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:47,386 INFO [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:47,386 WARN [org.apache.zookeeper.ClientCnxn] - java.lang.NoClassDefFoundError: org/apache/zookeeper/proto/WatcherEvent at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:770) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1144) [MYAPP-WEB] 2019-01-03 14:17:47,487 WARN [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:47,487 WARN [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:47,943 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:47,943 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:47,943 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:47,944 INFO [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:47,944 WARN [org.apache.zookeeper.ClientCnxn] - java.lang.NoClassDefFoundError: org/apache/zookeeper/proto/WatcherEvent at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:770) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1144) [MYAPP-WEB] 2019-01-03 14:17:48,044 WARN [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:48,044 WARN [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:48,687 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:48,687 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:48,688 INFO [org.apache.zookeeper.ClientCnxn] - [MYAPP-WEB] 2019-01-03 14:17:48,689 INFO [org.apache.solr.common.cloud.ConnectionManager] - [MYAPP-WEB] 2019-01-03 14:17:48,689 WARN [org.apache.zookeeper.ClientCnxn] - java.lang.NoClassDefFoundError: org/apache/zookeeper/proto/WatcherEvent at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:770) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1144) And from another server we get this: [MYAPP-WEB] 2019-01-03 14:19:47,273 WARN [org.apache.zookeeper.ClientCnxn] - java.lang.NoClassDefFoundError: org/apache/zookeeper/Login at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:216) at org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:119) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1011) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1063) [MYAPP-WEB] 2019-01-03 14:19:47,753 WARN [org.apache.zookeeper.ClientCnxn] - java.lang.NoClassDefFoundError: org/apache/zookeeper/Login at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:216) at org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:119) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1011) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1063) [MYAPP-WEB] 2019-01-03 14:19:48,197 INFO [gov.fbi.guardian.web.filter.SentinelRedirectFilter] - [MYAPP-WEB] 2019-01-03 14:19:48,450 WARN [org.apache.zookeeper.ClientCnxn] - java.lang.NoClassDefFoundError: org/apache/zookeeper/Login at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:216) at org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:119) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1011) at
Re: So Many Zookeeper Warnings--There Must Be a Problem
wrt, "You'll probably have to delete the contents of the zk data directory and rebuild your collections." Rebuild my *SOLR* collections? That's easy enough for us. If this is how we're incorrectly configured now: server #1 = myid#1 server #2 = myid#2 server #3 = myid#2 My plan would be to do the following, while users are still online (it's a big [bad] deal if we need to take search offline): 1. Take zk #3 down. 2. Fix zk #3 by deleting the contents of the zk data directory and assign it myid#3 3. Bring zk#3 back up 4. Do a full re-build of all collections Thanks! Joe -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr relevancy score different on replicated nodes
Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2 holding single instances of solr We are having a collection which was single shard and single replica i.e s=1 and rf=1 Few days back we tried to add replica to it.But the score for same query is coming different from different replicas. http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[ whereas on another machine(replica) http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[ The maxScore is different. Relevancy gets affected due to sharding but replication was not expected as same documents get copied to other node. score explaination gives issue with docCount and docFreq uneven. idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.050635000 docCount :*10020.0* docFreq :*3504.000* idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.068795100 docCount :*10291.0* docFreq :*3534.000* Is this expected?What could be wrong here?Please suggest -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Inconsistent debugQuery score with multiplicative boost
Hi! When debugging a query using multiplicative boost based on the product() function I noticed that the score computed in the explain section is correct while the score in the actual result is wrong. As an example here’s a simple query that boosts a field name_text_de (containing German product names). The term “Netzteil” boost to 200% and “Sony” boosts to 300%. A name that contains both terms would be boosted to 600%. If a term does not match, a default pseudo boost of 1 is used (multiplicative identity). The params of the responseHeader in the query result are: "q":"{!boost b=$ymb}(+{!lucene v=$yq})", "ymb":"product(query({!v=\"name_text_de\\:Netzteil\\^=2.0\"},1),query({!v=\"name_text_de\\:Sony\\^=3.0\"},1))", "yq":"*:*", The parsed query of the ymb parameter translates to: FunctionScoreQuery(FunctionScoreQuery(+*:*, scored by boost(product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0),query((ConstantScore(name_text_de:sony))^3.0,def=1.0) For a product that contains both terms, the score in the result and explain section correctly yields 6.0: "name_text_de":"Original Sony Vaio Netzteil", "score":6.0, 6.0 = product of: 1.0 = boost 6.0 = product of: 1.0 = *:* 6.0 = product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0)=2.0,query((ConstantScore(name_text_de:sony))^3.0,def=1.0)=3.0) However, for a product with only “Netzteil” in the name, the result score wrongly is 1.0 while the explain score correctly is 2.0: "name_text_de":"GS-Netzteil 20W schwarz", "score":1.0, 2.0 = product of: 1.0 = boost 2.0 = product of: 1.0 = *:* 2.0 = product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0)=2.0,query((ConstantScore(name_text_de:sony))^3.0,def=1.0)=1.0) (Note: the filter chain splits words on hyphen so the “GS-“ in front of the “Netzteil” should not be an issue.) Here’s the complete filter chain for the text_de field type: Interestingly if I simplify the query to only boost on “Netzteil”, the score in both the result and explain section are correctly 2.0. I reproduced this with a local Solr 7.5.0 server (no sharding, no replica) on Mac OS X 10.14.1. I found mention of a somewhat similar situation with BooleanQuery, which was considered a bug and fixed in 2016: https://issues.apache.org/jira/browse/LUCENE-7132 So my questions are: 1. Is there something wrong in my query that prevents the “Netzteil”-only product to get a score of 2.0? 2. Shouldn’t the score in the result and the explain section always be the same? Best regards, Thomas
RE: [solr-solrcloud] How does DIH work when there are multiple nodes?
Hi The reader was looking for a way to do 'DIH' automatically. The reason was for HA configuration. Thank you for answer. If you know how, please reply. -Original Message- From: Doss Sent: Friday, January 04, 2019 3:59 PM To: solr-user@lucene.apache.org Subject: RE: [solr-solrcloud] How does DIH work when there are multiple nodes? Hi, The data import process will not happen automatically, we have to do it manually through the admin interface or by calling the URL https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store- data-with-the-data-import-handler.html Full Import: http://node1ip:8983/solr/yourindexname/dataimport?command=full- import=true Delta Import: http://node1ip:8983/solr/yourindexname/dataimport?command=delta- import=true If you want to do the delta import automatically you can setup a cron (linux) which can call the URL periodically. Best, Doss. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html