Re: ApacheCon at Home 2020 starts tomorrow!
Thanks for sharing this Anshum. Day 1 had some really interesting sessions. Missed out on a couple that I would have liked to listen to. Are the recordings of these sessions available anywhere? -Rahul On Mon, Sep 28, 2020 at 7:08 PM Anshum Gupta wrote: > Hey everyone! > > ApacheCon at Home 2020 starts tomorrow. The event is 100% virtual, and free > to register. What’s even better is that this year we have reintroduced the > Lucene/Solr/Search track at ApacheCon. > > With 2 full days of sessions covering various Lucene, Solr, and Search, I > hope you are able to find some time to attend the sessions and learn > something new and interesting. > > There are also various other tracks that span the 3 days of the conference. > The conference starts in just a few hours for our community in Asia and > tomorrow morning for the Americas and Europe. Check out the complete > schedule in the link below. > > Here are a few resources you may find useful if you plan to attend > ApacheCon at Home. > > ApacheCon website - https://www.apachecon.com/acna2020/index.html > Registration - https://hopin.to/events/apachecon-home > Slack - http://s.apache.org/apachecon-slack > Search Track - https://www.apachecon.com/acah2020/tracks/search.html > > See you at ApacheCon. > > -- > Anshum Gupta >
Re: advice on whether to use stopwords for use case
I am not sure why you think stop words are your first choice. Maybe I misunderstand the question. I read it as that you need to exclude completely a set of documents that include specific keywords when called from specific module. If I wanted to differentiate the searches from specific module, I would give that module a different end-point (Request Query Handler), instead of /select. So, /nocigs or whatever. Then, in that end-point, you could do all sorts of extra things, such as setting appends or even invariants parameters, which would include filter query to exclude any documents matching specific keywords. I assume it is ok to return documents that are matching for other reasons. Ideally, you would mark the cigs documents during indexing with a binary or enumeration flag and then during search you just need to check against that flag. In that case, you could copyField your text and run it against something like https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter combined with Shingles for multiwords. Or similar. And just transform it as index-only so that the result is basically a yes/no flag. Similar thing could be done with UpdateRequestProcessor pipeline if you want to end up with a true boolean flag. The idea is the same, just to have an index-only flag that you force lock into for any request from specific module. Or even with something like ElevationSearchComponent. Same idea. Hope this helps. Regards, Alex. On Tue, 29 Sep 2020 at 22:28, Derek Poh wrote: > > Hi > > I have read in the mailings list that we should try to avoid using stop > words. > > I have a use case where I would like to know if there is other > alternative solutions beside using stop words. > > There is business requirement to return zero result when the search is > cigarette related words and the search is coming from a particular > module on our site. It does not apply to all searches from our site. > There is a list of these cigarette related words. This list contains > single word, multiple words (Electronic cigar), multiple words with > punctuation (e-cigarette case). > I am planning to copy a different set of search fields, that will > include the stopword filter in the index and query stage, for this > module to use. > > For this use case, other than using stop words to handle it, is there > any alternative solution? > > Derek > > -- > CONFIDENTIALITY NOTICE > > This e-mail (including any attachments) may contain confidential and/or > privileged information. If you are not the intended recipient or have > received this e-mail in error, please inform the sender immediately and > delete this e-mail (including any attachments) from your computer, and you > must not use, disclose to anyone else or copy this e-mail (including any > attachments), whether in whole or in part. > > This e-mail and any reply to it may be monitored for security, legal, > regulatory compliance and/or other appropriate reasons.
advice on whether to use stopwords for use case
Hi I have read in the mailings list that we should try to avoid using stop words. I have a use case where I would like to know if there is other alternative solutions beside using stop words. There is business requirement to return zero result when the search is cigarette related words and the search is coming from a particular module on our site. It does not apply to all searches from our site. There is a list of these cigarette related words. This list contains single word, multiple words (Electronic cigar), multiple words with punctuation (e-cigarette case). I am planning to copy a different set of search fields, that will include the stopword filter in the index and query stage, for this module to use. For this use case, other than using stop words to handle it, is there any alternative solution? Derek -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
Re: Slow Solr 8 response for long query
What do the debug versions of the query show between two versions? One thing that changed is sow (split on whitespace) parameter among many. It is unlikely to be the cause, but I am mentioning just in case. https://lucene.apache.org/solr/guide/8_6/the-standard-query-parser.html#standard-query-parser-parameters Regards, Alex On Tue, 29 Sep 2020 at 20:47, Permakoff, Vadim wrote: > > Hi Solr Experts! > We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long > query, which has a search text plus many OR and AND conditions (all in one > place, the query is about 20KB long). > For the same set of data (about 500K docs) and the same schema the query in > Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to > get 10 results. If I increase the number of rows to 300, in Solr 6 it takes > about 10 sec, in Solr 8 it takes more than 1 min. The results are small, just > IDs. It looks like the relevancy scoring plays role, because if I move this > query to filter query - both Solr versions work pretty fast. > The right way should be to change the query, but unfortunately it is > difficult to modify the application which creates these queries, so I want to > find some temporary workaround. > > What was changed from Solr 6 to Solr 8 in terms of scoring with many > conditions, which affects the search speed negatively? > Is there anything to configure in Solr 8 to get the same performance for such > query like it was in Solr 6? > > Thank you, > Vadim > > > > This email is intended solely for the recipient. It may contain privileged, > proprietary or confidential information or material. If you are not the > intended recipient, please delete this email and any attachments and notify > the sender of the error.
Slow Solr 8 response for long query
Hi Solr Experts! We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long query, which has a search text plus many OR and AND conditions (all in one place, the query is about 20KB long). For the same set of data (about 500K docs) and the same schema the query in Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to get 10 results. If I increase the number of rows to 300, in Solr 6 it takes about 10 sec, in Solr 8 it takes more than 1 min. The results are small, just IDs. It looks like the relevancy scoring plays role, because if I move this query to filter query - both Solr versions work pretty fast. The right way should be to change the query, but unfortunately it is difficult to modify the application which creates these queries, so I want to find some temporary workaround. What was changed from Solr 6 to Solr 8 in terms of scoring with many conditions, which affects the search speed negatively? Is there anything to configure in Solr 8 to get the same performance for such query like it was in Solr 6? Thank you, Vadim This email is intended solely for the recipient. It may contain privileged, proprietary or confidential information or material. If you are not the intended recipient, please delete this email and any attachments and notify the sender of the error.
Re: How to Resolve : "The request took too long to iterate over doc values"?
Hey Erick, In cases for which we are getting this warning, I'm not able to extract the `exact solr query`. Instead logger is logging `parsedquery ` for such cases. Here is one example: 2020-09-29 13:09:41.279 WARN (qtp926837661-82461) [c:mycollection s:shard1_0 r:core_node5 x:mycollection_shard1_0_replica_n3] o.a.s.s.SolrIndexSearcher Query: [+FunctionScoreQuery(+*:*, scored by boost(product(if(max(const(0),sub(float(my_doc_value_field1),const(50))),const(0.01),if(max(const(0),sub(float(my_doc_value_field2),const(29))),const(0.2),const(1))),sqrt(product(sum(const(1),float(my_doc_value_field3),float(my_doc_value_field4)),sqrt(sum(const(1),float(my_doc_value_field5 #BitSetDocTopFilter]; The request took too long to iterate over doc values. Timeout: timeoutAt: 1635297585120522 (System.nanoTime(): 1635297690311384), DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@7df12bf1 As per my understanding query in the above case is `q=*:*`. And then there is boost function which uses functional query on my_doc_value_field* (fieldtype doc_value_field i.e having index=false and docValue = true) to reorder matched docs. If docValue works efficiently for _function queries_ then why this warning are coming? Also, we do use frange queries on doc_value_field (having index=false and docValue = true). example: {!frange l=1.0}my_doc_value_field1 Erick Erickson wrote > Let’s see the query. My bet is that you are _searching_ against the field > and have indexed=false. > > Searching against a docValues=true indexed=false field results in the > equivalent of a “table scan” in the RDBMS world. You may use > the docValues efficiently for _function queries_ to mimic some > search behavior. > > Best, > Erick -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Vulnerabilities in SOLR 8.6.2
Solr follows the ASF policy for reporting vulnerabilities, described in this page on our website: https://lucene.apache.org/solr/security.html. This page also lists known vulnerabilities that have been addressed, with their mitigation steps. Scanning tools are commonly full of false positives so for this reason the community does not accept the unfiltered scanner output such as a spreadsheet as a vulnerability report. We attempt to maintain a list of known false positives (also linked from the website) at: https://cwiki.apache.org/confluence/display/SOLR/SolrSecurity#SolrSecurity-SolrandVulnerabilityScanningTools. But in all honestly such a list is really hard to keep up with. Exact versions in your report may differ from what’s on the list, but usually the general conclusion that it’s not an exploitable issue remains. For example, our list notes a CVE for ‘dom4j-1.6.1.jar' is not an exploitable vulnerability because it is only used in tests. If a CVE comes out for ‘dom4j-1.7.3.jar’ (if such a version exists), the fact remains that the dependency is only used in tests and is still not exploitable in a production system. If you do find a real vulnerability you are concerned about, ASF policy is for you to privately report it to the community so it can be addressed before hackers have a chance to attempt to exploit user systems. How to do that is also described in the Security page in our website linked above. -Cassandra On Sep 28, 2020, 2:07 PM -0500, Narayanan, Lakshmi , wrote: > Hello Solr-User Support team > We have installed the SOLR 8.6.2 package into docker container in our DEV > environment. Prior to using it, our security team scanned the docker image > using SysDig and found a lot of Critical/High/Medium vulnerabilities. The > full list is in the attached spreadsheet > > Scan Summary > 30 STOPS 190 WARNS 188 Vulnerabilities > > Please advise or point us to how/where to get a package that has been patched > for the Critical/High/Medium vulnerabilities in the attached spreadsheet > Your help will be gratefully received > > > Lakshmi Narayanan > Marsh & McLennan Companies > 121 River Street, Hoboken,NJ-07030 > 201-284-3345 > M: 845-300-3809 > Email: lakshmi.naraya...@mmc.com > > > > > > ** > This e-mail, including any attachments that accompany it, may contain > information that is confidential or privileged. This e-mail is > intended solely for the use of the individual(s) to whom it was intended to be > addressed. If you have received this e-mail and are not an intended recipient, > any disclosure, distribution, copying or other use or > retention of this email or information contained within it are prohibited. > If you have received this email in error, please immediately > reply to the sender via e-mail and also permanently > delete all copies of the original message together with any of its attachments > from your computer or device. > **
Re: How to Resolve : "The request took too long to iterate over doc values"?
Let’s see the query. My bet is that you are _searching_ against the field and have indexed=false. Searching against a docValues=true indexed=false field results in the equivalent of a “table scan” in the RDBMS world. You may use the docValues efficiently for _function queries_ to mimic some search behavior. Best, Erick > On Sep 29, 2020, at 6:59 AM, raj.yadav wrote: > > In our index, we have few fields defined as `ExternalFileField` field type. > We decided to use docValues for such fields. Here is the field type > definition > > OLD => (ExternalFileField) > defVal="0.0" class="solr.ExternalFileField"/> > > NEW => (docValues) > indexed="false" stored="false" docValues="true" > useDocValuesAsStored="false"/> > > After this modification we started getting the following `timeout warning` > messages: > > ```The request took too long to iterate over doc values. Timeout: timeoutAt: > 1626463774823735 (System.nanoTime(): 1626463774836490), > DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@4efddff > ``` > > Our system configuration: > Each Solr Instance: 8 vcpus, 64 GiB memory > JAVA Memory: 30GB > Collection: 4 shards (each shard has approximately 12 million docs and index > size of 12 GB) and each Solr instance has one replica of the shard. > > GC_TUNE="-XX:NewRatio=3 \ > -XX:SurvivorRatio=4 \ > -XX:PermSize=64m \ > -XX:MaxPermSize=64m \ > -XX:TargetSurvivorRatio=80 \ > -XX:MaxTenuringThreshold=9 \ > -XX:+UseConcMarkSweepGC \ > -XX:+UseParNewGC \ > -XX:+CMSClassUnloadingEnabled \ > -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \ > -XX:+CMSScavengeBeforeRemark \ > -XX:PretenureSizeThreshold=64m \ > -XX:+UseCMSInitiatingOccupancyOnly \ > -XX:CMSInitiatingOccupancyFraction=50 \ > -XX:CMSMaxAbortablePrecleanTime=6000 \ > -XX:+CMSParallelRemarkEnabled \ > -XX:+ParallelRefProcEnabled" > > 1. What this warning message means? > 2. How to resolve it? > > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
How to Resolve : "The request took too long to iterate over doc values"?
In our index, we have few fields defined as `ExternalFileField` field type. We decided to use docValues for such fields. Here is the field type definition OLD => (ExternalFileField) NEW => (docValues) After this modification we started getting the following `timeout warning` messages: ```The request took too long to iterate over doc values. Timeout: timeoutAt: 1626463774823735 (System.nanoTime(): 1626463774836490), DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@4efddff ``` Our system configuration: Each Solr Instance: 8 vcpus, 64 GiB memory JAVA Memory: 30GB Collection: 4 shards (each shard has approximately 12 million docs and index size of 12 GB) and each Solr instance has one replica of the shard. GC_TUNE="-XX:NewRatio=3 \ -XX:SurvivorRatio=4 \ -XX:PermSize=64m \ -XX:MaxPermSize=64m \ -XX:TargetSurvivorRatio=80 \ -XX:MaxTenuringThreshold=9 \ -XX:+UseConcMarkSweepGC \ -XX:+UseParNewGC \ -XX:+CMSClassUnloadingEnabled \ -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \ -XX:+CMSScavengeBeforeRemark \ -XX:PretenureSizeThreshold=64m \ -XX:+UseCMSInitiatingOccupancyOnly \ -XX:CMSInitiatingOccupancyFraction=50 \ -XX:CMSMaxAbortablePrecleanTime=6000 \ -XX:+CMSParallelRemarkEnabled \ -XX:+ParallelRefProcEnabled" 1. What this warning message means? 2. How to resolve it? -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr Web UI
Hello all, Our Solr web ui (/solr/#/) doesn't show query results if it takes longer than, say 3-4 seconds. When I look at the browser console, I see the request is getting cancelled. I went through the javascript code but didn't see a part that cancels the request after a couple of seconds. Do you see this behavior too? Is it intentional? I usually use Postman for querying so this is not a problem most of the time, but I just wanted to see streaming expression explanation diagrams. Have a nice day~~ -- uyilmaz
Re: SOLR Cursor Pagination Issue
Hi Erick,"You still haven’t given an example of the results you’re seeing that are unexpected". I will give an example of the data I received. Before starting data update I have: solrCloud: Expected series criteria:386062 Collected series: 386062 Number of requests: 40 Collected unique series: 386062. Similar results for nodes in solr cloud. During the process of updating the series I have: solrCloud: Expected series criteria:386062 Collected series: 445550 Number of requests: 124 Collected unique series: 386062. First node: Expected series criteria:386062 Collected series: 1442775 Number of requests: 146 Collected unique series: 386062. Second node: Expected series criteria:386062 Collected series: 242823 Number of requests: 26 Collected unique series: 242823. After the completion of the data update. I get the data as before the update. Best, Vlad Mon, 28 Sep 2020 10:51:01 -0400, Erick Erickson писал(а): I said nothing about docId changing. _any_ sort criteria changing is an issue. You’re sorting by score. Well, as you index documents, the new docs change the values used to calculate scores for _all_ documents will change, thus changing the sort order and potentially causing unexpected results when using cursormark. That said, I don’t think you’re getting any different scores at all if you’re really searching for “(* AND *)", try returning score in the fl list, are they different? You still haven’t given an example of the results you’re seeing that are unexpected. And my assumption is that you are seeing odd results when you call this query again with a cursorMark returned by a previous call. Or are you saying that you don’t think facet.query is returning the correct count? Be aware that Solr doesn’t support true Boolean logic, see: https://lucidworks.com/post/why-not-and-or-and-not/ There’s special handling for the form "fq=NOT something” to change it to "fq=*:* NOT something” that’s not present in something like "q=NOT something”. How that plays in facet.query I’m not sure, but try “facet.query=*:* NOT something” if the facet count is what the problem is. l have no idea what you’re trying to accomplish with (* AND *) unless those are just placeholders and you put real text in them. That’s rather odd. *:* is “select everything”... BTW, returning 10,000 docs is somewhat of an anti-pattern, if you really require that many documents consider streaming. On Sep 28, 2020, at 10:21 AM, vmakov...@xbsoftware.by wrote: Hi, Erick I have a python script that sends requests with CursorMark. This script checks data against the following Expected series criteria: Collected series: Number of requests: Collected unique series: The request looks like this: select?indent=off&defType=edismax&wt=json&facet.query={!key=NUM_DOCS}NOT SERIES_ID:0&fq=NOT SERIES_ID:0&spellcheck=true&spellcheck.collate=true&spellcheck.extendedResults=true&facet.limit=-1&q=(* AND *)&qf=all_text_stemming all_text&fq=facet_db_code:( "CN" )&fq=-SERIES_CODE:( "TEST" )&fl=SERIES_ID&sort=score desc,docId asc&bq=SERIES_STATUS:T^5&bq=KEY_SERIES_FLAG:1^5&bq=accuracy_name:0&bq=SERIES_STATUS:C^-30&rows=1&cursorMark=* DocId does not change during data update.During data updating process in solrCloud skript returnd incorect Number of requests and Collected series. Best, Vlad Mon, 28 Sep 2020 08:54:57 -0400, Erick Erickson писал(а): Define “incorrect” please. Also, showing the exact query you use would be helpful. That said, indexing data at the same time you are using CursorMark is not guaranteed do find all documents. Consider a sort with date asc, id asc. doc53 has a date of 2001 and you’re already returned the doc. Next, you update doc53 to 2020. It now appears sometime later in the results due to the changed data. Or the other way, doc53 starts with 2020, and while your cursormark label is in 2010, you change doc53 to have a date of 2001. It will never be returned. Similarly for anything else you change that’s relevant to the sort criteria you’re using. CursorMark doesn’t remember _documents_, just, well, call it the fingerprint (i.e. sort criteria values) of the last document returned so far. Best, Erick On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote: Good afternoon, Could you please suggest us a solution: during data updating process in solrCloud, requests with cursor mark return incorrect data. I suppose that the results do not follow each other during the indexation process, because the data doesn't have enough time to be replicated between the nodes. Kind regards, Vladislav Makovski Vladislav Makovski Developer XB Software Ltd. | Minsk, Belarus Site: https://xbsoftware.com Skype: vlad__makovski Cell: +37529 6484100 Vladislav Makovski Developer XB Software Ltd. | Minsk, Belarus Site: https://xbsoftware.com Skype: vlad__makovski Cell: +37529 6484100
Re: Returning fields a specific order
Hi, If data are in json format, you should use jq -S https://stackoverflow.com/a/38210345/5998915 Regards Dominique Le lun. 28 sept. 2020 à 18:30, gnandre a écrit : > Hi, > > I have a use-case where I want to compare stored fields values of Solr > documents from two different Solr instances. I can use a diff tool to > compare them but only if they returned the fields in specific order in the > response. I tried setting fl param with all the fields specified in > particular order. However, the results that are returned do not follow > specific order given in fl param. Is there any way to achieve this behavior > in Solr? >