Re: Solr 7.7 - Few Questions

2020-10-01 Thread Rahul Goswami
Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:

> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> > We are using Apache Solr 7.7 on Windows platform. The data is synced to
> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
> The document size is very huge (~0.5GB average) and solr indexing is taking
> long time. Total document size is ~200GB. As the solr commit is done as a
> part of API, the API calls are failing as document indexing is not
> completed.
>
> A single document is five hundred megabytes?  What kind of documents do
> you have?  You can't even index something that big without tweaking
> configuration parameters that most people don't even know about.
> Assuming you can even get it working, there's no way that indexing a
> document like that is going to be fast.
>
> >1.  What is your advise on syncing such a large volume of data to
> Solr KB.
>
> What is "KB"?  I have never heard of this in relation to Solr.
>
> >2.  Because of the search requirements, almost 8 fields are defined
> as Text fields.
>
> I can't figure out what you are trying to say with this statement.
>
> >3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a
> large volume of data?
>
> If just one of the documents you're sending to Solr really is five
> hundred megabytes, then 2 gigabytes would probably be just barely enough
> to index one document into an empty index ... and it would probably be
> doing garbage collection so frequently that it would make things REALLY
> slow.  I have no way to predict how much heap you will need.  That will
> require experimentation.  I can tell you that 2GB is definitely not enough.
>
> >4.  How to set up Solr in production on Windows? Currently it's set
> up as a standalone engine and client is requested to take the backup of the
> drive. Is there any other better way to do? How to set up for the disaster
> recovery?
>
> I would suggest NOT doing it on Windows.  My reasons for that come down
> to costs -- a Windows Server license isn't cheap.
>
> That said, there's nothing wrong with running on Windows, but you're on
> your own as far as running it as a service.  We only have a service
> installer for UNIX-type systems.  Most of the testing for that is done
> on Linux.
>
> >5.  How to benchmark the system requirements for such a huge data
>
> I do not know what all your needs are, so I have no way to answer this.
> You're going to know a lot more about it that any of us are.
>
> Thanks,
> Shawn
>


Re: Authentication for each collection

2020-10-01 Thread Chris Hostetter


https://lucene.apache.org/solr/guide/8_6/authentication-and-authorization-plugins.html

*Authentication* is global, but *Authorization* can be configured to use 
rules that restrict permissions on a per collection basis...

https://lucene.apache.org/solr/guide/8_6/rule-based-authorization-plugin.html#permissions-2

In concrete terms, the specific example you asked about is supported:

: Example ; user1:password1 for collection A
:  user2:password2 for collection B

what would *NOT* be supported is to have a distinct set of users for each 
collection, such that there could be two different "user1" instances, each 
with it's own password, where each "user1" had access to one collection.



: Date: Thu, 1 Oct 2020 13:45:14 -0700
: From: sambasivarao giddaluri 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Authentication for each collection
: 
: Hi All,
: We have 2 collections and we are using  basic authentication against solr ,
: configured in security.json . Is it possible to configure in such a way
: that we have different credentials for each collection . Please advise if
: there is any other approach i can look into.
: 
: Example ; user1:password1 for collection A
:  user2:password2 for collection B
: 

-Hoss
http://www.lucidworks.com/


Re: Non Deterministic Results from /admin/luke

2020-10-01 Thread Shawn Heisey

On 10/1/2020 4:24 AM, Nussbaum, Ronen wrote:

We are using the Luke API in order to get all dynamic field names from our 
collection:
/solr/collection/admin/luke?wt=csv=0

This worked fine in 6.2.1 but it's non deterministic anymore (8.6.1) - looks 
like it queries a random single shard.

I've tried using /solr/collection/select?q=*:*=csv=0 but it 
behaves the same.

Can it be configured to query all shards?
Is there another way to achieve this?


The Luke handler (usually at /admin/luke) is not SolrCloud aware.  It is 
designed to operate on a single core.  So if you send the request to the 
collection and not a specific core, Solr must forward the request to a 
core in order for you to get ANY result.  The core selection will be random.


The software called Luke (which is where the Luke handler gets its name) 
operates on a Lucene index -- each Solr core is based around a Lucene 
index.  It would be a LOT of work to make the handler SolrCloud aware.


Depending on how your collection is set up, you may need to query the 
Luke handler on multiple cores in order to get a full picture of all 
fields present in the Lucene indexes.  I am not aware of any other way 
to do it.


Thanks,
Shawn


Re: Solr 7.7 - Few Questions

2020-10-01 Thread Shawn Heisey

On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.


A single document is five hundred megabytes?  What kind of documents do 
you have?  You can't even index something that big without tweaking 
configuration parameters that most people don't even know about. 
Assuming you can even get it working, there's no way that indexing a 
document like that is going to be fast.



   1.  What is your advise on syncing such a large volume of data to Solr KB.


What is "KB"?  I have never heard of this in relation to Solr.


   2.  Because of the search requirements, almost 8 fields are defined as Text 
fields.


I can't figure out what you are trying to say with this statement.


   3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data?


If just one of the documents you're sending to Solr really is five 
hundred megabytes, then 2 gigabytes would probably be just barely enough 
to index one document into an empty index ... and it would probably be 
doing garbage collection so frequently that it would make things REALLY 
slow.  I have no way to predict how much heap you will need.  That will 
require experimentation.  I can tell you that 2GB is definitely not enough.



   4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?


I would suggest NOT doing it on Windows.  My reasons for that come down 
to costs -- a Windows Server license isn't cheap.


That said, there's nothing wrong with running on Windows, but you're on 
your own as far as running it as a service.  We only have a service 
installer for UNIX-type systems.  Most of the testing for that is done 
on Linux.



   5.  How to benchmark the system requirements for such a huge data


I do not know what all your needs are, so I have no way to answer this. 
You're going to know a lot more about it that any of us are.


Thanks,
Shawn


Re: Solr client in JavaScript

2020-10-01 Thread Shawn Heisey

On 10/1/2020 3:55 AM, Sunil Dash wrote:

This is my javascript code ,from where I am calling solr ,which has a
loaded nutch core (index).
My java script client ( runs on TOMCAT server) and Solr
server are on the same machine (10.21.6.100) . May be due to cross
domain references issues OR something is missing I don't know.
I expected Response from Solr server (search result) as raw JASON
object. Kindly help me fix it.Thanks in advance .


As far as I can tell, your message doesn't tell us what the problem is. 
So I'm having a hard time coming up with a useful response.


If the problem is that the response isn't JSON, then either you need to 
tell Solr that you want JSON, or run a new enough version that the 
default response format *IS* JSON.  I do not recall which version we 
changed the default from XML to JSON.


One thing you should be aware of ... if the javascript is running in the 
end user's browser, then the end user has direct access to your Solr 
install.  That is a bad idea.


Thanks,
Shawn


Authentication for each collection

2020-10-01 Thread sambasivarao giddaluri
Hi All,
We have 2 collections and we are using  basic authentication against solr ,
configured in security.json . Is it possible to configure in such a way
that we have different credentials for each collection . Please advise if
there is any other approach i can look into.

Example ; user1:password1 for collection A
 user2:password2 for collection B


RE: advice on whether to use stopwords for use case

2020-10-01 Thread Markus Jelsma
Well, when not splitting on whitespace you can the CharFilter for regex 
replacements [1] to clear the entire search string if anywhere in the string a 
banned word is found: 

.*(cigarette|tobacco).*

[1] 
https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.PatternReplaceCharFilterFactory
 
-Original message-
> From:Walter Underwood 
> Sent: Thursday 1st October 2020 18:20
> To: solr-user@lucene.apache.org
> Subject: Re: advice on whether to use stopwords for use case
> 
> I can’t think of an easy way to do this in Solr.
> 
> Do a bunch of string searches on the query on the client side. If any of them 
> match, 
> make a “no hits” result page.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> > On Sep 30, 2020, at 11:56 PM, Derek Poh  wrote:
> > 
> > Yes, the requirements (for now) is not to return any results. I think they 
> > may change the requirements,pending their return from the holidays.
> > 
> >> If so, then check for those words in the query before sending it to Solr.
> > That is what I think so too.
> > 
> > Thinking further, using stopwords for this, there will still be results 
> > return when the number of words in the search keywords is more than the 
> > stopwords.
> > 
> > On 1/10/2020 2:57 am, Walter Underwood wrote:
> >> I’m not clear on the requirements. It sounds like the query “cigar” or 
> >> “cuban cigar”
> >> should return zero results. Is that right?
> >> 
> >> If so, then check for those words in the query before sending it to Solr.
> >> 
> >> But the stopwords approach seems like the requirement is different. Could 
> >> you give
> >> some examples?
> >> 
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org 
> >> http://observer.wunderwood.org/   (my 
> >> blog)
> >> 
> >>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
> >>>  wrote:
> >>> 
> >>> You may also want to look at something like: 
> >>> https://docs.querqy.org/index.html 
> >>> 
> >>> ApacheCon had (is having..) a presentation on it that seemed quite
> >>> relevant to your needs. The videos should be live in a week or so.
> >>> 
> >>> Regards,
> >>>   Alex.
> >>> 
> >>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
> >>>  wrote:
>  I am not sure why you think stop words are your first choice. Maybe I
>  misunderstand the question. I read it as that you need to exclude
>  completely a set of documents that include specific keywords when
>  called from specific module.
>  
>  If I wanted to differentiate the searches from specific module, I
>  would give that module a different end-point (Request Query Handler),
>  instead of /select. So, /nocigs or whatever.
>  
>  Then, in that end-point, you could do all sorts of extra things, such
>  as setting appends or even invariants parameters, which would include
>  filter query to exclude any documents matching specific keywords. I
>  assume it is ok to return documents that are matching for other
>  reasons.
>  
>  Ideally, you would mark the cigs documents during indexing with a
>  binary or enumeration flag and then during search you just need to
>  check against that flag. In that case, you could copyField  your text
>  and run it against something like
>  https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>   
>  
>  combined with Shingles for multiwords. Or similar. And just transform
>  it as index-only so that the result is basically a yes/no flag.
>  Similar thing could be done with UpdateRequestProcessor pipeline if
>  you want to end up with a true boolean flag. The idea is the same,
>  just to have an index-only flag that you force lock into for any
>  request from specific module.
>  
>  Or even with something like ElevationSearchComponent. Same idea.
>  
>  Hope this helps.
>  
>  Regards,
>    Alex.
>  
>  On Tue, 29 Sep 2020 at 22:28, Derek Poh  
>   wrote:
> > Hi
> > 
> > I have read in the mailings list that we should try to avoid using stop
> > words.
> > 
> > I have a use case where I would like to know if there is other
> > alternative solutions beside using stop words.
> > 
> > There is business requirement to return zero result when the search is
> > cigarette related words and the search is coming from a particular
> > module on our site. It does not apply to all searches from our site.
> > There is a list of these cigarette related words. This list contains
> > single word, multiple words (Electronic cigar), multiple words with
> 

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Walter Underwood
I can’t think of an easy way to do this in Solr.

Do a bunch of string searches on the query on the client side. If any of them 
match, 
make a “no hits” result page.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:56 PM, Derek Poh  wrote:
> 
> Yes, the requirements (for now) is not to return any results. I think they 
> may change the requirements,pending their return from the holidays.
> 
>> If so, then check for those words in the query before sending it to Solr.
> That is what I think so too.
> 
> Thinking further, using stopwords for this, there will still be results 
> return when the number of words in the search keywords is more than the 
> stopwords.
> 
> On 1/10/2020 2:57 am, Walter Underwood wrote:
>> I’m not clear on the requirements. It sounds like the query “cigar” or 
>> “cuban cigar”
>> should return zero results. Is that right?
>> 
>> If so, then check for those words in the query before sending it to Solr.
>> 
>> But the stopwords approach seems like the requirement is different. Could 
>> you give
>> some examples?
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org 
>> http://observer.wunderwood.org/   (my blog)
>> 
>>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
>>>  wrote:
>>> 
>>> You may also want to look at something like: 
>>> https://docs.querqy.org/index.html 
>>> 
>>> ApacheCon had (is having..) a presentation on it that seemed quite
>>> relevant to your needs. The videos should be live in a week or so.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
>>>  wrote:
 I am not sure why you think stop words are your first choice. Maybe I
 misunderstand the question. I read it as that you need to exclude
 completely a set of documents that include specific keywords when
 called from specific module.
 
 If I wanted to differentiate the searches from specific module, I
 would give that module a different end-point (Request Query Handler),
 instead of /select. So, /nocigs or whatever.
 
 Then, in that end-point, you could do all sorts of extra things, such
 as setting appends or even invariants parameters, which would include
 filter query to exclude any documents matching specific keywords. I
 assume it is ok to return documents that are matching for other
 reasons.
 
 Ideally, you would mark the cigs documents during indexing with a
 binary or enumeration flag and then during search you just need to
 check against that flag. In that case, you could copyField  your text
 and run it against something like
 https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
  
 
 combined with Shingles for multiwords. Or similar. And just transform
 it as index-only so that the result is basically a yes/no flag.
 Similar thing could be done with UpdateRequestProcessor pipeline if
 you want to end up with a true boolean flag. The idea is the same,
 just to have an index-only flag that you force lock into for any
 request from specific module.
 
 Or even with something like ElevationSearchComponent. Same idea.
 
 Hope this helps.
 
 Regards,
   Alex.
 
 On Tue, 29 Sep 2020 at 22:28, Derek Poh  
  wrote:
> Hi
> 
> I have read in the mailings list that we should try to avoid using stop
> words.
> 
> I have a use case where I would like to know if there is other
> alternative solutions beside using stop words.
> 
> There is business requirement to return zero result when the search is
> cigarette related words and the search is coming from a particular
> module on our site. It does not apply to all searches from our site.
> There is a list of these cigarette related words. This list contains
> single word, multiple words (Electronic cigar), multiple words with
> punctuation (e-cigarette case).
> I am planning to copy a different set of search fields, that will
> include the stopword filter in the index and query stage, for this
> module to use.
> 
> For this use case, other than using stop words to handle it, is there
> any alternative solution?
> 
> Derek
> 
> --
> CONFIDENTIALITY NOTICE
> 
> This e-mail (including any attachments) may contain confidential and/or 
> privileged information. If you are not the intended recipient or have 
> received this e-mail in error, please inform the sender immediately and 
> delete this e-mail (including any attachments) from your computer, and 
> you must not 

Using streaming expressions with shards filter

2020-10-01 Thread Gael Jourdan-Weil
Hello,

I am trying to use a Streaming Expression to query only a subset of the shards 
of a collection.
I expected to be able to use the "shards" parameter like on a regular query on 
"/select" for instance but this appear to not work or I don't know how to do it.

Is this somehow a feature/restriction of Streaming expressions?
Or am I missing something?

Note that the Streaming Expression I use is actually using the "/export" 
request handler.

Example of the streaming expression:
curl -X POST -v --data-urlencode 
'expr=search(myCollection,q="*:*",fl="id",sort="id asc",qt="/export")' 
'http://myserver/solr/myCollection/stream'

Solr version: 8.4

Best regards,
Gaël

Luke admin API return inconsistent result

2020-10-01 Thread Raboah, Avi
Hi All,

We are using the Luke API in order to get all dynamic field names from our 
collection:
/solr/collection/admin/luke?wt=csv=0

This worked fine in 6.2.1 but it's non deterministic anymore (8.6.1) - looks 
like it queries a random single shard.

I've tried using /solr/collection/select?q=*:*=csv=0 but it 
behaves the same.

Can it be configured to query all shards?
Is there another way to achieve this?

Thanks a lot.

Have a great weekend.

Avi.



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


HealthCheck not working when network problems between the client and a Solr node

2020-10-01 Thread Elizaveta Golova
Hello,
 
We are using Solr  8.5.2
 
We are having trouble with dealing with network errors between a Solr node and 
a client.
In our situation, our Solr Nodes and Zk hosts are healthy and can communication 
with each other, all our collections are up and healthy.
 
When we simulate a network problem between a client and a Solr Node (whilst 
maintaining the connections and healthy status of everything else), our Admin 
health check (HealthCheckRequest)fails with this type of network issue as we 
get a
"org.apache.solr.client.solrj.SolrServerException: IOException occurred when 
talking to server at: https://solr2:8984/solr "
with the root cause being a 
"java.net.SocketTimeoutException: connect timed out"
(seen in LBSolrClient).
 
In admin commands, it appears that the client's Zombie list is only updated and 
the operation only continues when the root cause is a ConnectException. 
We can confirm that a ConnectException (by changing it manually in the 
debugger) works as we would like. The operation succeeds. And subsequent calls 
to the client consider our blocked node as a Zombie.

A SocketTimeoutException type of exception does not update the client's Zombie 
list and continue with the operation, instead throwing an overall exception. 
And as the Zombie list is not updated, next time we try with the same client, 
we have the same problem as the node that has been blocked is still the first 
one that is returned in the live nodes list, and is the first that the request 
is sent to.
 
How can we work around this?
 
We have drilled down into the LBSolrClient to have a look.
 
Our main concern is that we believe that this will also be a problem for us 
with Updates.
 
 
An example scenario:
Solr1 on server Solr1
Solr2 on server Solr2
A collection with replication factor 2 with replicas for each shard being 
hosted on both Solr nodes.
An application server is on ApplicationServer1.
Another application server is on ApplicationServer2.
 
The Solr Nodes are up and the collection is healthy.
 
(Depending on the order of the live nodes)
If access is blocked to Solr2 from ApplicationServer1, update from 
ApplicationServer1 should succeed and a health check/ping from 
ApplicationServer1 should return "healthy".
Update from ApplicationServer2 should succeed and health check/ping from 
ApplicationServer2 should return "healthy".
 
If access is then unblocked to Solr2 from ApplicationServer1 but blocked to 
Solr1, then update from ApplicationServer1 fails and a health check/ping from 
ApplicationServer1 throws an exception.
Update from ApplicationServer2 should succeed and health check/ping from 
ApplicationServer2 should return "healthy".
 
Redacted stacktrace:

[err] org.apache.solr.client.solrj.SolrServerException: IOException occurred 
when talking to server at: https://solr2:8984/solr
[err] at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:695)
[err] at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
[err] at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
[err] at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:370)
[err] at 
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:298)
[err] at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1157)
[err] at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:918)
[err] at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:850)
[err] at  
(SolrClientProxy.java:136)
[err] at 
[err] at 
[err] at 
[err] at 
[err] at 
[err] at 
[err] at 
[err] Caused by: 
[err] org.apache.http.conn.ConnectTimeoutException: Connect to solr2:8984 
[solr2/172.18.0.6] failed: connect timed out
[err] at 
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
[err] at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
[err] at 
org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
[err] at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
[err] at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
[err] at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
[err] at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
[err] at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
[err] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
[err] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
[err] at 

RE: Solr 7.7 - Few Questions

2020-10-01 Thread Manisha Rahatadkar
I apologize for sending this email again, I don't mean to spam the mailbox but 
looking out for the urgent help.

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.


  1.  What is your advise on syncing such a large volume of data to Solr KB.
  2.  Because of the search requirements, almost 8 fields are defined as Text 
fields.
  3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data?
  4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?
  5.  How to benchmark the system requirements for such a huge data

Thanks in advance.

Regards
Manisha Rahatadkar


Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.


Re: Solr 7.6 query performace question

2020-10-01 Thread raj.yadav
harjags wrote
> Below errors are very common in 7.6 and we have solr nodes failing with
> tanking memory.
> 
> The request took too long to iterate over terms. Timeout: timeoutAt:
> 162874656583645 (System.nanoTime(): 162874701942020),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@74507f4a
> 
> or 
> 
> #*BitSetDocTopFilter*]; The request took too long to iterate over terms.
> Timeout: timeoutAt: 33288640223586 (System.nanoTime(): 33288700895778),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@5e458644
> 
> 
> or 
> 
> #SortedIntDocSetTopFilter]; The request took too long to iterate over
> terms.
> Timeout: timeoutAt: 552497919389297 (System.nanoTime(): 552508251053558),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@60b7186e
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



We are also seeing such errors in our log. But our nodes are not failing
also the frequency of such warnings are less then 5% of overall traffic.
What does this error means.
Can someone eleaborate following :
1. What does `The request took too long to iterate over terms` means ? 
2. what is `BitSetDocTopFilter` and `SortedIntDocSetTopFilter` ?

Regards,
Raj



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Transaction not closed on ms sql

2020-10-01 Thread Erick Erickson
First of all, I’d just use a stand-alone program to do your 
processing for a number of reasons, see:

https://lucidworks.com/post/indexing-with-solrj/

1- I suspect your connection will be closed eventually. Since it’s expensive to
open one of these, the driver may keep it open for a while.

2 - This is one of the reasons I’d go to something outside Solr. The
 link above gives you a skeletal program that’ll show you how. It
 has the usual problem of demo code, it needs more error checking
and the like.

3 - see TolerantUpdateProcessor(Factory).

Best,
Erick

> On Sep 30, 2020, at 10:43 PM, yaswanth kumar  wrote:
> 
> Can some one help in troubleshooting some issues that happening from DIH??
> 
> Solr version: 8.2; zookeeper 3.4
> Solr cloud with 4 nodes and 3 zookeepers
> 
> 1. Configured DIH for ms sql with mssql jdbc driver, and when trying to pull 
> the data from mssql it’s connecting and fetching records but we do see the 
> connection that was opened on the other end mssql was not closed even though 
> the full import was completed .. need some help in troubleshooting why it’s 
> leaving connections open
> 
> 2. The way I have scheduled this import api call is like a util that will be 
> hitting DIH api every min with a solr pool url and with this it looks like 
> multiple calls are going from different solr nodes which I don’t want .. I 
> always need the call to be taken by only one node.. can we control this with 
> any config?? Or is this happening because I have three zoo’s?? Please suggest 
> the best approach 
> 
> 3. I do see some records are shown as failed while doing import, is there a 
> way to track these failures?? Like why a minimal no of records are failing??
> 
> 
> 
> Sent from my iPhone



Daylight savings time issue using NOW in Solr 6.1.0

2020-10-01 Thread vishal patel

Hi

I am using Solr 6.1.0. My SOLR_TIMEZONE=UTC  in solr.in.cmd.
My current Solr server machine time zone is also UTC.

My one collection has below one field in schema.


Suppose my current Solr server machine time is 2020-10-01 10:00:00.000. I have 
one document in that collection and in that document action_date is 
2020-10-01T09:45:46Z.
When I search in Solr action_date:[2020-10-01T08:00:00Z TO NOW] , I cannot 
return that record. I check my solr log and found that time was different 
between Solr log time and solr server machine time.(almost 1 hours difference)

Why I cannot get the result? Why NOW is not taking the 2020-10-01T10:00:00Z?
"NOW" takes which time? Is there difference due to daylight saving 
time? How can I configure 
or change timezone which consider daylight saving time?

Regards,
Vishal



Solr client in JavaScript

2020-10-01 Thread Sunil Dash
This is my javascript code ,from where I am calling solr ,which has a
loaded nutch core (index).
My java script client ( runs on TOMCAT server) and Solr
server are on the same machine (10.21.6.100) . May be due to cross
domain references issues OR something is missing I don't know.
I expected Response from Solr server (search result) as raw JASON
object. Kindly help me fix it.Thanks in advance .

Rgds
Sunil Kumar



  Solr Search  


function search()
{
 var xmlHttpReq =false;
 var  xmlHttpClient=this;

 var hostURL='http://10.21.6.100:8983/solr/nutch/select';
 var querystring=document.getElementById("querystring").value;
 qstr='q='+escape(querystring)+"&fl=content";

 if(window.XMLHttpRequest){  xmlHttpClient.xmlHttpReq=new
XMLHttpRequest(); }

 xmlHttpClient.xmlHttpReq.open('POST',hostURL,true);


xmlHttpClient.xmlHttpReq.setRequestHeader('Content-Type','application/x-www-form-urlencoded');

 xmlHttpClient.xmlHttpReq.send(qstr);

 xmlHttpClient.xmlHttpReq.onreadystatechange=function()
  {
  if(xmlHttpClient.xmlHttpReq.readyState ==4 )
{ showresponse(xmlHttpClient.xmlHttpReq.responseText);}
  }

}

 function showResponse(str)
{
   document.getElementById("responsestring").innerHTML=str;
}



 Solr Search [ From Javascript ]

 
 
 


 



Non Deterministic Results from /admin/luke

2020-10-01 Thread Nussbaum, Ronen
Hi All,

We are using the Luke API in order to get all dynamic field names from our 
collection:
/solr/collection/admin/luke?wt=csv=0

This worked fine in 6.2.1 but it's non deterministic anymore (8.6.1) - looks 
like it queries a random single shard.

I've tried using /solr/collection/select?q=*:*=csv=0 but it 
behaves the same.

Can it be configured to query all shards?
Is there another way to achieve this?

Thanks in advance,
Ronen.




This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Proposals for health checks new parameters

2020-10-01 Thread Taisuke Miyazaki
Hi,
I want to add a parameter to the handler for health checks.
In our case, we want to add a parameter like "failIfEmptyCores" because we
want the state of a node with no core to be an error.
I don't think it will change the existing behavior by setting default to
false.

What do you think about this change?

In hand, that code looks like it's working fine when written.
If it's OK, I'd like to write a patch for it.

Best regards,
taisuke


Re: advice on whether to use stopwords for use case

2020-10-01 Thread Derek Poh
Yes, the requirements (for now) is not to return any results. I think 
they may change the requirements,pending their return from the holidays.



If so, then check for those words in the query before sending it to Solr.

That is what I think so too.

Thinking further, using stopwords for this, there will still be results 
return when the number of words in the search keywords is more than the 
stopwords.


On 1/10/2020 2:57 am, Walter Underwood wrote:

I’m not clear on the requirements. It sounds like the query “cigar” or “cuban 
cigar”
should return zero results. Is that right?

If so, then check for those words in the query before sending it to Solr.

But the stopwords approach seems like the requirement is different. Could you 
give
some examples?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  wrote:

You may also want to look at something like: https://docs.querqy.org/index.html

ApacheCon had (is having..) a presentation on it that seemed quite
relevant to your needs. The videos should be live in a week or so.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  wrote:

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:

Hi

I have read in the mailings list that we should try to avoid using stop
words.

I have a use case where I would like to know if there is other
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is
cigarette related words and the search is coming from a particular
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains
single word, multiple words (Electronic cigar), multiple words with
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will
include the stopword filter in the index and query stage, for this
module to use.

For this use case, other than using stop words to handle it, is there
any alternative solution?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Derek Poh

Hi Alex

The business requirement (for now) is not to return any result when the 
search keywords are cigarette related. The business user team will 
provide the list of the cigarette related keywords.


Will digest, explore and research on your suggestions. Thank you.

On 30/9/2020 10:56 am, Alexandre Rafalovitch wrote:

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:

Hi

I have read in the mailings list that we should try to avoid using stop
words.

I have a use case where I would like to know if there is other
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is
cigarette related words and the search is coming from a particular
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains
single word, multiple words (Electronic cigar), multiple words with
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will
include the stopword filter in the index and query stage, for this
module to use.

For this use case, other than using stop words to handle it, is there
any alternative solution?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.