Re: ApacheCon at Home 2020 starts tomorrow!

2020-09-29 Thread Rahul Goswami
Thanks for sharing this Anshum. Day 1 had some really interesting sessions.
Missed out on a couple that I would have liked to listen to. Are the
recordings of these sessions available anywhere?

-Rahul

On Mon, Sep 28, 2020 at 7:08 PM Anshum Gupta  wrote:

> Hey everyone!
>
> ApacheCon at Home 2020 starts tomorrow. The event is 100% virtual, and free
> to register. What’s even better is that this year we have reintroduced the
> Lucene/Solr/Search track at ApacheCon.
>
> With 2 full days of sessions covering various Lucene, Solr, and Search, I
> hope you are able to find some time to attend the sessions and learn
> something new and interesting.
>
> There are also various other tracks that span the 3 days of the conference.
> The conference starts in just a few hours for our community in Asia and
> tomorrow morning for the Americas and Europe. Check out the complete
> schedule in the link below.
>
> Here are a few resources you may find useful if you plan to attend
> ApacheCon at Home.
>
> ApacheCon website - https://www.apachecon.com/acna2020/index.html
> Registration - https://hopin.to/events/apachecon-home
> Slack - http://s.apache.org/apachecon-slack
> Search Track - https://www.apachecon.com/acah2020/tracks/search.html
>
> See you at ApacheCon.
>
> --
> Anshum Gupta
>


Re: advice on whether to use stopwords for use case

2020-09-29 Thread Alexandre Rafalovitch
I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
>
> Hi
>
> I have read in the mailings list that we should try to avoid using stop
> words.
>
> I have a use case where I would like to know if there is other
> alternative solutions beside using stop words.
>
> There is business requirement to return zero result when the search is
> cigarette related words and the search is coming from a particular
> module on our site. It does not apply to all searches from our site.
> There is a list of these cigarette related words. This list contains
> single word, multiple words (Electronic cigar), multiple words with
> punctuation (e-cigarette case).
> I am planning to copy a different set of search fields, that will
> include the stopword filter in the index and query stage, for this
> module to use.
>
> For this use case, other than using stop words to handle it, is there
> any alternative solution?
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
>
> This e-mail (including any attachments) may contain confidential and/or 
> privileged information. If you are not the intended recipient or have 
> received this e-mail in error, please inform the sender immediately and 
> delete this e-mail (including any attachments) from your computer, and you 
> must not use, disclose to anyone else or copy this e-mail (including any 
> attachments), whether in whole or in part.
>
> This e-mail and any reply to it may be monitored for security, legal, 
> regulatory compliance and/or other appropriate reasons.


advice on whether to use stopwords for use case

2020-09-29 Thread Derek Poh

Hi

I have read in the mailings list that we should try to avoid using stop 
words.


I have a use case where I would like to know if there is other 
alternative solutions beside using stop words.


There is business requirement to return zero result when the search is 
cigarette related words and the search is coming from a particular 
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains 
single word, multiple words (Electronic cigar), multiple words with 
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will 
include the stopword filter in the index and query stage, for this 
module to use.


For this use case, other than using stop words to handle it, is there 
any alternative solution?


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Slow Solr 8 response for long query

2020-09-29 Thread Alexandre Rafalovitch
What do the debug versions of the query show between two versions?

One thing that changed is sow (split on whitespace) parameter among
many. It is unlikely to be the cause, but I am mentioning just in
case.
https://lucene.apache.org/solr/guide/8_6/the-standard-query-parser.html#standard-query-parser-parameters

Regards,
   Alex

On Tue, 29 Sep 2020 at 20:47, Permakoff, Vadim
 wrote:
>
> Hi Solr Experts!
> We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long 
> query, which has a search text plus many OR and AND conditions (all in one 
> place, the query is about 20KB long).
> For the same set of data (about 500K docs) and the same schema the query in 
> Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to 
> get 10 results. If I increase the number of rows to 300, in Solr 6 it takes 
> about 10 sec, in Solr 8 it takes more than 1 min. The results are small, just 
> IDs. It looks like the relevancy scoring plays role, because if I move this 
> query to filter query - both Solr versions work pretty fast.
> The right way should be to change the query, but unfortunately it is 
> difficult to modify the application which creates these queries, so I want to 
> find some temporary workaround.
>
> What was changed from Solr 6 to Solr 8 in terms of scoring with many 
> conditions, which affects the search speed negatively?
> Is there anything to configure in Solr 8 to get the same performance for such 
> query like it was in Solr 6?
>
> Thank you,
> Vadim
>
> 
>
> This email is intended solely for the recipient. It may contain privileged, 
> proprietary or confidential information or material. If you are not the 
> intended recipient, please delete this email and any attachments and notify 
> the sender of the error.


Slow Solr 8 response for long query

2020-09-29 Thread Permakoff, Vadim
Hi Solr Experts!
We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long 
query, which has a search text plus many OR and AND conditions (all in one 
place, the query is about 20KB long).
For the same set of data (about 500K docs) and the same schema the query in 
Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to get 
10 results. If I increase the number of rows to 300, in Solr 6 it takes about 
10 sec, in Solr 8 it takes more than 1 min. The results are small, just IDs. It 
looks like the relevancy scoring plays role, because if I move this query to 
filter query - both Solr versions work pretty fast.
The right way should be to change the query, but unfortunately it is difficult 
to modify the application which creates these queries, so I want to find some 
temporary workaround.

What was changed from Solr 6 to Solr 8 in terms of scoring with many 
conditions, which affects the search speed negatively?
Is there anything to configure in Solr 8 to get the same performance for such 
query like it was in Solr 6?

Thank you,
Vadim



This email is intended solely for the recipient. It may contain privileged, 
proprietary or confidential information or material. If you are not the 
intended recipient, please delete this email and any attachments and notify the 
sender of the error.


Re: How to Resolve : "The request took too long to iterate over doc values"?

2020-09-29 Thread raj.yadav
Hey Erick,

In cases for which we are getting this warning, I'm not able to extract the
`exact solr query`. Instead logger is logging `parsedquery ` for such cases.
Here is one example:


2020-09-29 13:09:41.279 WARN  (qtp926837661-82461) [c:mycollection
s:shard1_0 r:core_node5 x:mycollection_shard1_0_replica_n3]
o.a.s.s.SolrIndexSearcher Query: [+FunctionScoreQuery(+*:*, scored by
boost(product(if(max(const(0),sub(float(my_doc_value_field1),const(50))),const(0.01),if(max(const(0),sub(float(my_doc_value_field2),const(29))),const(0.2),const(1))),sqrt(product(sum(const(1),float(my_doc_value_field3),float(my_doc_value_field4)),sqrt(sum(const(1),float(my_doc_value_field5
#BitSetDocTopFilter]; The request took too long to iterate over doc values.
Timeout: timeoutAt: 1635297585120522 (System.nanoTime(): 1635297690311384),
DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@7df12bf1



As per my understanding query in the above case is `q=*:*`. And then there
is boost function which uses functional query on my_doc_value_field*
(fieldtype doc_value_field i.e having index=false and docValue = true) to
reorder matched docs. If docValue works efficiently for _function queries_
then why this warning are coming?


Also, we do use frange queries on doc_value_field (having index=false and
docValue = true).
example:
{!frange l=1.0}my_doc_value_field1


Erick Erickson wrote
> Let’s see the query. My bet is that you are _searching_ against the field
> and have indexed=false.
> 
> Searching against a docValues=true indexed=false field results in the
> equivalent of a “table scan” in the RDBMS world. You may use
> the docValues efficiently for _function queries_ to mimic some
> search behavior.
> 
> Best,
> Erick





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Vulnerabilities in SOLR 8.6.2

2020-09-29 Thread Cassandra Targett
Solr follows the ASF policy for reporting vulnerabilities, described in this 
page on our website: https://lucene.apache.org/solr/security.html. This page 
also lists known vulnerabilities that have been addressed, with their 
mitigation steps.

Scanning tools are commonly full of false positives so for this reason the 
community does not accept the unfiltered scanner output such as a spreadsheet 
as a vulnerability report.

We attempt to maintain a list of known false positives (also linked from the 
website) at: 
https://cwiki.apache.org/confluence/display/SOLR/SolrSecurity#SolrSecurity-SolrandVulnerabilityScanningTools.
 But in all honestly such a list is really hard to keep up with. Exact versions 
in your report may differ from what’s on the list, but usually the general 
conclusion that it’s not an exploitable issue remains. For example, our list 
notes a CVE for ‘dom4j-1.6.1.jar' is not an exploitable vulnerability because 
it is only used in tests. If a CVE comes out for ‘dom4j-1.7.3.jar’ (if such a 
version exists), the fact remains that the dependency is only used in tests and 
is still not exploitable in a production system.

If you do find a real vulnerability you are concerned about, ASF policy is for 
you to privately report it to the community so it can be addressed before 
hackers have a chance to attempt to exploit user systems. How to do that is 
also described in the Security page in our website linked above.

-Cassandra
On Sep 28, 2020, 2:07 PM -0500, Narayanan, Lakshmi 
, wrote:
> Hello Solr-User Support team
> We have installed the SOLR 8.6.2 package into docker container in our DEV 
> environment. Prior to using it, our security team scanned the docker image 
> using SysDig and found a lot of Critical/High/Medium vulnerabilities. The 
> full list is in the attached spreadsheet
>
> Scan Summary
> 30 STOPS 190 WARNS    188 Vulnerabilities
>
> Please advise or point us to how/where to get a package that has been patched 
> for the Critical/High/Medium vulnerabilities in the attached spreadsheet
> Your help will be gratefully received
>
>
> Lakshmi Narayanan
> Marsh & McLennan Companies
> 121 River Street, Hoboken,NJ-07030
> 201-284-3345
> M: 845-300-3809
> Email: lakshmi.naraya...@mmc.com
>
>
>
>
>
> **
> This e-mail, including any attachments that accompany it, may contain
> information that is confidential or privileged. This e-mail is
> intended solely for the use of the individual(s) to whom it was intended to be
> addressed. If you have received this e-mail and are not an intended recipient,
> any disclosure, distribution, copying or other use or
> retention of this email or information contained within it are prohibited.
> If you have received this email in error, please immediately
> reply to the sender via e-mail and also permanently
> delete all copies of the original message together with any of its attachments
> from your computer or device.
> **


Re: How to Resolve : "The request took too long to iterate over doc values"?

2020-09-29 Thread Erick Erickson
Let’s see the query. My bet is that you are _searching_ against the field and 
have indexed=false.

Searching against a docValues=true indexed=false field results in the
equivalent of a “table scan” in the RDBMS world. You may use
the docValues efficiently for _function queries_ to mimic some
search behavior.

Best,
Erick

> On Sep 29, 2020, at 6:59 AM, raj.yadav  wrote:
> 
> In our index, we have few fields defined as `ExternalFileField` field type.
> We decided to use docValues for such fields. Here is the field type
> definition
> 
> OLD => (ExternalFileField)
>  defVal="0.0" class="solr.ExternalFileField"/>
> 
> NEW => (docValues)
>  indexed="false" stored="false"  docValues="true"
> useDocValuesAsStored="false"/>
> 
> After this modification we started getting the following `timeout warning`
> messages:
> 
> ```The request took too long to iterate over doc values. Timeout: timeoutAt:
> 1626463774823735 (System.nanoTime(): 1626463774836490),
> DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@4efddff
> ```
> 
> Our system configuration:
> Each Solr Instance: 8 vcpus, 64 GiB memory
> JAVA Memory: 30GB
> Collection: 4 shards (each shard has approximately 12 million docs and index
> size of 12 GB) and each Solr instance has one replica of the shard. 
> 
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:PermSize=64m \
> -XX:MaxPermSize=64m \
> -XX:TargetSurvivorRatio=80 \
> -XX:MaxTenuringThreshold=9 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:+CMSClassUnloadingEnabled \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> 
> 1. What this warning message means?
> 2. How to resolve it?
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



How to Resolve : "The request took too long to iterate over doc values"?

2020-09-29 Thread raj.yadav
In our index, we have few fields defined as `ExternalFileField` field type.
We decided to use docValues for such fields. Here is the field type
definition

OLD => (ExternalFileField)


NEW => (docValues)


After this modification we started getting the following `timeout warning`
messages:

```The request took too long to iterate over doc values. Timeout: timeoutAt:
1626463774823735 (System.nanoTime(): 1626463774836490),
DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@4efddff
```

Our system configuration:
Each Solr Instance: 8 vcpus, 64 GiB memory
JAVA Memory: 30GB
Collection: 4 shards (each shard has approximately 12 million docs and index
size of 12 GB) and each Solr instance has one replica of the shard. 

GC_TUNE="-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:PermSize=64m \
-XX:MaxPermSize=64m \
-XX:TargetSurvivorRatio=80 \
-XX:MaxTenuringThreshold=9 \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSClassUnloadingEnabled \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled"
 
1. What this warning message means?
2. How to resolve it?




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Web UI

2020-09-29 Thread uyilmaz


Hello all,

Our Solr web ui (/solr/#/) doesn't show query results if it takes longer than, 
say 3-4 seconds. When I look at the browser console, I see the request is 
getting cancelled. I went through the javascript code but didn't see a part 
that cancels the request after a couple of seconds. Do you see this behavior 
too? Is it intentional?

I usually use Postman for querying so this is not a problem most of the time, 
but I just wanted to see streaming expression explanation diagrams.

Have a nice day~~

-- 
uyilmaz 


Re: SOLR Cursor Pagination Issue

2020-09-29 Thread vmakovsky
Hi Erick,"You still haven’t given an example of the results you’re seeing 
that are unexpected".
I will give an example of the data I received. Before starting data update I 
have:

solrCloud: Expected series criteria:386062
Collected series: 386062
Number of requests: 40
Collected unique series: 386062.
Similar results for nodes in solr cloud.
During the process of updating the series I have:
solrCloud: Expected series criteria:386062
Collected series: 445550
Number of requests: 124
Collected unique series: 386062.
First node:
Expected series criteria:386062
Collected series: 1442775
Number of requests: 146
Collected unique series: 386062.
Second node:
Expected series criteria:386062
Collected series: 242823
Number of requests: 26
Collected unique series: 242823.
After the completion of the data update. I get the data as before the 
update.

Best,
Vlad
 
Mon, 28 Sep 2020 10:51:01 -0400, Erick Erickson  
писал(а):


I said nothing about docId changing. _any_ sort criteria changing is 
an issue. You’re sorting by score. Well, as you index documents, the 
new docs change the values used to calculate scores for _all_ 
documents will change, thus changing the sort order and potentially 
causing unexpected results when using cursormark. That said, I don’t 
think you’re getting any different scores at all if you’re really 
searching for “(* AND *)", try returning score in the fl list, are 
they different?


You still haven’t given an example of the results you’re seeing that 
are unexpected. And my assumption is that you are seeing odd results 
when you call this query again with a cursorMark returned by a 
previous call. Or are you saying that you don’t think facet.query is 
returning the correct count? Be aware that Solr doesn’t support true 
Boolean logic, see: 
https://lucidworks.com/post/why-not-and-or-and-not/


There’s special handling for the form "fq=NOT something” to change 
it to "fq=*:* NOT something” that’s not present in something like 
"q=NOT something”. How that plays in facet.query I’m not sure, but 
try “facet.query=*:* NOT something” if the facet count is what the 
problem is.


l have no idea what you’re trying to accomplish with (* AND *) 
unless those are just placeholders and you put real text in them. 
That’s rather odd. *:* is “select everything”...


BTW, returning 10,000 docs is somewhat of an anti-pattern, if you 
really require that many documents consider streaming.



On Sep 28, 2020, at 10:21 AM, vmakov...@xbsoftware.by wrote:

Hi, Erick

I have a python script that sends requests with CursorMark. This 
script checks data against the following Expected series criteria:

Collected series:
Number of requests:
Collected unique series:
The request looks like this: 
select?indent=off=edismax=json={!key=NUM_DOCS}NOT 
SERIES_ID:0=NOT 
SERIES_ID:0=true=true=true=-1=(* 
AND *)=all_text_stemming all_text=facet_db_code:( "CN" 
)=-SERIES_CODE:( "TEST" )=SERIES_ID=score desc,docId 
asc=SERIES_STATUS:T^5=KEY_SERIES_FLAG:1^5=accuracy_name:0=SERIES_STATUS:C^-30=1=*


DocId does not change during data update.During data updating 
process in solrCloud skript returnd incorect Number of requests and 
Collected series.


Best,
Vlad


Mon, 28 Sep 2020 08:54:57 -0400, Erick Erickson 
 писал(а):


Define “incorrect” please. Also, showing the exact query you use 
would be helpful.
That said, indexing data at the same time you are using CursorMark 
is not guaranteed do find all documents. Consider a sort with date 
asc, id asc. doc53 has a date of 2001 and you’re already returned the 
doc.
Next, you update doc53 to 2020. It now appears sometime later in the 
results due to the changed data. Or the other way, doc53 starts with 
2020, and while your cursormark label is in 2010, you change doc53 to 
have a date of 2001. It will never be returned.
Similarly for anything else you change that’s relevant to the sort 
criteria you’re using.
CursorMark doesn’t remember _documents_, just, well, call it the 
fingerprint (i.e. sort criteria values) of the last document returned 
so far.

Best,
Erick

On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote:
Good afternoon,
Could you please suggest us a solution: during data updating process 
in solrCloud, requests with cursor mark return incorrect data. I 
suppose that the results do not follow each other during the 
indexation process, because the data doesn't have enough time to be 
replicated between the nodes.

Kind regards,
Vladislav Makovski

Vladislav Makovski
Developer
XB Software Ltd. | Minsk, Belarus
Site: https://xbsoftware.com
Skype: vlad__makovski
Cell:  +37529 6484100




Vladislav Makovski
Developer
XB Software Ltd. | Minsk, Belarus
Site: https://xbsoftware.com
Skype: vlad__makovski
Cell:  +37529 6484100


Re: Returning fields a specific order

2020-09-29 Thread Dominique Bejean
Hi,

If data are in json format, you should use jq -S
https://stackoverflow.com/a/38210345/5998915

Regards

Dominique


Le lun. 28 sept. 2020 à 18:30, gnandre  a écrit :

> Hi,
>
> I have a use-case where I want to compare stored fields values of Solr
> documents from two different Solr instances. I can use a diff tool to
> compare them but only if they returned the fields in specific order in the
> response. I tried setting fl param with all the fields specified in
> particular order. However, the results that are returned do not follow
> specific order given in fl param. Is there any way to achieve this behavior
> in Solr?
>