Re: SolrClient.query take a 'collection' argument

2020-06-05 Thread Shawn Heisey

On 6/5/2020 4:24 PM, Jim Anderson wrote:

I am running my first solrj program and it is crashing when I call the
method

client.query("coreName",queryParms)

The API doc says the string should be a collection. I'm still not sure
about the difference between a collection and a core, so what I am doing is
likely illegal. Given that I have created a core, create a collection from
it so that I can truly pass a collection name to the query function?


The concept of a collection comes from SolrCloud.  A collection is made 
up of one or more shards.  A shard is made up of one or more replicas. 
Each replica is a core.  If you're not running SolrCloud, then you do 
not have collections.


Wherever SolrJ docs says "collection" as a parameter for a request, it 
is likely that you can think "core" instead and have it still be 
correct.  If you're running SolrCloud, you'll want to be very careful to 
know the difference.


It seems very odd for a SolrJ query to cause the program to crash.  It 
would be pretty common for it to throw an exception, but that's not the 
same as a crash, unless exception handling is incorrect or missing.


Thanks,
Shawn


SolrClient.query take a 'collection' argument

2020-06-05 Thread Jim Anderson
I am running my first solrj program and it is crashing when I call the
method

client.query("coreName",queryParms)

The API doc says the string should be a collection. I'm still not sure
about the difference between a collection and a core, so what I am doing is
likely illegal. Given that I have created a core, create a collection from
it so that I can truly pass a collection name to the query function?

Jim A.


Getting to grips with auto-scaling

2020-06-05 Thread Tom Evans
Hi

I'm trying to get a handle on the newer auto-scaling features in Solr.
We're in the process of upgrading an older SolrCloud cluster from 5.5
to 8.5, and re-architecture it slightly to improve performance and
automate operations.

If I boil it down slightly, currently we have two collections, "items"
and "lists". Both collections have just one shard. We publish new data
to "items" once each day, and our users search and do analysis on
them, whilst "lists" contains NRT user-specified collections of ids
from items, which we join to from "items" in order to allow them to
restrict their searches/analysis to just docs in their curated lists.

Most of our searches have specific date ranges in them, usually only
from the last 3 years or so, but sometimes we need to do searches
across all the data. With the new setup, we want to:

* shard by date (year) to make the hottest data available in smaller shards
* have more nodes with these shards than we do of the older data.
* be able to add/remove nodes predictably based upon our clients
(predictable) query load
* use TLOG for "items" and NRT for "lists", to avoid unnecessary
indexing load for "items" and have NRT for "lists".
* spread cores across two AZ

With that in mind, I came up with a bunch of simplified rules for
testing, with just 4 shards for "items":

* "lists" collection has one NRT replica on each node
* "items" collection shard 2020 has one TLOG replica on each node
* "items" collection shard 2019 has one TLOG replica on 75% of nodes
* "items" collection shards 2018 and 2017 each have one TLOG replica
on 50% of nodes
* all shards have at least 2 replicas if number of nodes > 1
* no node should have 2 replicas of the same shard
* number of cores should be balanced across nodes

Eg, with 1 node, I want to see this topology:
A: items: 2020, 2019, 2018, 2017 + lists

with 2 nodes:
A: items: 2020, 2019, 2018, 2017 + lists
B: items: 2020, 2019, 2018, 2017 + lists

and if I add two more nodes:
A: items: 2020, 2019, 2018 + lists
B: items: 2020, 2019, 2017 + lists
C: items: 2020, 2019, 2017 + lists
D: items: 2020, 2018 + lists

To the questions:

* The type of replica created when nodeAdded is triggered can't be set
per collection. Either everything gets NRT or everything gets TLOG.
Even if I specify nrtReplicas=0 when creating a collection, nodeAdded
will add NRT replicas if configured that way.
* I'm having difficulty expressing these rules in terms of a policy -
I can't seem to figure out a way to specify the number of replicas for
a shard based upon the total number of nodes.
* Is this beyond the current scope of autoscaling triggers/policies?
Should I instead use the trigger with a custom plugin action (or to
trigger a web hook) to be a bit more intelligent?
* Am I wasting my time trying to ensure there are more replicas of the
hotter shards than the colder shards? It seems to add a lot of
complexity - should I just instead think that they aren't getting
queried much, so won't be using up cache space that the hot shards
will be using. Disk space is pretty cheap after all (total size for
"items" + "lists" is under 60GB).

Cheers

Tom


Re: Edismax query using different strings for different fields

2020-06-05 Thread Erick Erickson
Let’s see the results of adding =query to the query, in particular the 
parsed version.

Because what you’re reporting doesn’t really make sense. edismax should be 
totally
ignoring the “qf” parameter since you’re specifically qualifying all the 
clauses with
a field. Unless you’re not really enclosing the search text in parentheses (or 
quotes
if they should be phrases).

Also, if you’re willing to form separate clauses like this, there's no reason 
to even
use edismax since its purpose is to automatically distribute search terms over 
multiple
fields and you’re explicitly specifying the fields..

Best,
Erick

> On Jun 5, 2020, at 10:10 AM, David Zimmermann  wrote:
> 
> I could need some advice on how to handle a particular cross language search 
> with Solr. I posted it on Stackoverflow 2 months ago, but could not find a 
> solution.
> I have documents in 3 languages (English, German, French). For simplicity 
> let's assume it's just two languages (English and German). The documents are 
> standardised in the sense that they contain the same parts (text_part1 and 
> text_part2), just the language they are written in is different. The language 
> of the documents is known. In my index schema I use one core with different 
> fields for each language.
> 
> For a German document the index will look something like this:
> 
>  *   text_part1_en: empty
>  *   text_part2_en: empty
>  *   text_part1_de: German text
>  *   text_part2_de: Another German text
> 
> For an English document it will be the other way around.
> 
> What I want to achieve: A user entering a query in English should receive 
> both, English and German documents that are relevant to his search. Further 
> conditions are:
> 
>  *   I want results with hits in text_part1 and text_part2 to be higher 
> ranked than results with hits only in one field (tie value > 0).
>  *   The queries will not be single words, but full sentences (stop word 
> removal needed and partial hits [only a few words out of the sentences] must 
> be valid).
>  *   English and German documents must output into one ranking. I need to be 
> able to compare the relevance of an English document to the relevance of a 
> German document.
>  *   the text parts need to stay separate, I want to boost the importance of 
> (let's say part1) over the other.
> 
> My general approach so far has been to get a German translation of the user's 
> query by sending it to a translation API. Then I want use an edismax query, 
> since it seems to fulfill all of my requirements. The problem is that I 
> cannot manage to search for the German query in the German fields and the 
> English query in the English fields only. The Solr edismax 
> documentation
>  states that it supports the full Lucene query parser syntax, but I can't 
> find a way to address different fields with different inputs. I tried:
> 
> q=text_part1_en: (A sentence in English) text_part1_de: (Ein Satz auf 
> Deutsch) text_part2_en: (A sentence in English) text_part2_de: (Ein Satz auf 
> Deutsch)
> qf=text_part1_en text_part2_en text_part1_de text_part2_de
> 
> 
> This syntax should be in line with what MatsLindh wrote in this 
> thread.
>  I tried different versions of writing this q, but whatever I do Solr always 
> search for the full q string in all four fields given by qf, which totally 
> messes up the result. Am I just making mistakes in the query syntax or is it 
> even possible to do what I'm trying to do using edismax?
> 
> Any help would be highly appreciated.



Re: Solr takes time to warm up core with huge data

2020-06-05 Thread Shawn Heisey

On 6/5/2020 12:17 AM, Srinivas Kashyap wrote:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 
asc,PHY_KEY6 asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc

This was the original query. Since there were lot of sorting fields, we decided 
to not do on the solr side, instead fetch the query response and do the sorting 
outside solr. This eliminated the need of more JVM memory which was allocated. 
Every time we ran this query, solr would crash exceeding the JVM memory. Now we 
are only running filter queries.


What Solr version, and what is the definition of each of the fields 
you're sorting on?  If the definition doesn't include docValues, then a 
large on-heap memory structure will be created for sorting (VERY large 
with 500 million docs), and I wouldn't be surprised if it's created even 
if it is never used.  The definition for any field you use for sorting 
should definitely include docValues.  In recent Solr versions, docValues 
defaults to true for most field types.  Some field classes, TextField in 
particular, cannot have docValues.


There's something else to discuss about sort params -- each sort field 
will only be used if ALL of the previous sort fields are identical for 
two documents in the full numFound result set.  Having more than two or 
three sort fields is usually pointless.  My guess (which I know could be 
wrong) is that most queries with this HUGE sort parameter will never use 
anything beyond TRACK_ID.



And regarding the filter cache, it is in default setup: (we are using default 
solrconfig.xml, and we have only added the request handler for DIH)




This is way too big for your index, and a prime candidate for why your 
heap requirements are so high.  Like I said before, if the filterCache 
on your system actually reaches this max size, it will require 30GB of 
memory JUST for the filterCache on this core.  Can you check the admin 
UI to determine what the size is and what hit ratio it's getting? (1.0 
is 100% on the hit ratio).  I'd probably start with a size of 32 or 64 
on this cache.  With a size of 64, a little less than 4GB would be the 
max heap allocated for the cache.  You can experiment... but with 500 
million docs, the filterCache size should be pretty small.


You're going to want to carefully digest this part of that wiki page 
that I linked earlier.  Hopefully email will preserve this link completely:


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-Reducingheaprequirements

Thanks,
Shawn


RE: Script to check if solr is running

2020-06-05 Thread Dunigan, Craig A.
A simple cronjob with /bin/solr status and /bin/solr start should do the trick.  There must be a Windows 
equivalent if that’s what you’re using.

From: Ryan W 
Sent: Thursday, June 4, 2020 11:39 AM
To: solr-user@lucene.apache.org
Subject: Re: Script to check if solr is running

WARNING: This email originated outside of Lands’ End. Please be on the lookout 
for phishing scams and do not open attachments or click links from people you 
do not know..

Or is it not much overhead to give the command to start solr if it is
already running? Maybe it's not necessary to check if it's running? Is
there any downside to giving the start command every 15 minutes or so
whether it is running or not?

Thanks.

On Thu, Jun 4, 2020 at 12:36 PM Ryan W 
mailto:rya...@gmail.com>> wrote:

> Does anyone have a script that checks if solr is running and then starts
> it if it isn't running? Occasionally my solr stops running even if there
> has been no Apache restart. I haven't been able to determine the root
> cause, so the next best thing might be to check every 15 minutes or so if
> it's running and run it if it has stopped.
>
> Thanks.
>


Re: Script to check if solr is running

2020-06-05 Thread Walter Underwood
Most Linux distros are using systemd to manage server processes.

https://en.wikipedia.org/wiki/Systemd 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 5, 2020, at 8:08 AM, Mark H. Wood  wrote:
> 
> On Thu, Jun 04, 2020 at 12:36:30PM -0400, Ryan W wrote:
>> Does anyone have a script that checks if solr is running and then starts it
>> if it isn't running?  Occasionally my solr stops running even if there has
>> been no Apache restart.  I haven't been able to determine the root cause,
>> so the next best thing might be to check every 15 minutes or so if it's
>> running and run it if it has stopped.
> 
> I've used Monit for things that must be kept running:
> 
>  https://mmonit.com/monit/
> 
> -- 
> Mark H. Wood
> Lead Technology Analyst
> 
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu



Re: Script to check if solr is running

2020-06-05 Thread Mark H. Wood
On Thu, Jun 04, 2020 at 12:36:30PM -0400, Ryan W wrote:
> Does anyone have a script that checks if solr is running and then starts it
> if it isn't running?  Occasionally my solr stops running even if there has
> been no Apache restart.  I haven't been able to determine the root cause,
> so the next best thing might be to check every 15 minutes or so if it's
> running and run it if it has stopped.

I've used Monit for things that must be kept running:

  https://mmonit.com/monit/

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu


signature.asc
Description: PGP signature


Edismax query using different strings for different fields

2020-06-05 Thread David Zimmermann
I could need some advice on how to handle a particular cross language search 
with Solr. I posted it on Stackoverflow 2 months ago, but could not find a 
solution.
I have documents in 3 languages (English, German, French). For simplicity let's 
assume it's just two languages (English and German). The documents are 
standardised in the sense that they contain the same parts (text_part1 and 
text_part2), just the language they are written in is different. The language 
of the documents is known. In my index schema I use one core with different 
fields for each language.

For a German document the index will look something like this:

  *   text_part1_en: empty
  *   text_part2_en: empty
  *   text_part1_de: German text
  *   text_part2_de: Another German text

For an English document it will be the other way around.

What I want to achieve: A user entering a query in English should receive both, 
English and German documents that are relevant to his search. Further 
conditions are:

  *   I want results with hits in text_part1 and text_part2 to be higher ranked 
than results with hits only in one field (tie value > 0).
  *   The queries will not be single words, but full sentences (stop word 
removal needed and partial hits [only a few words out of the sentences] must be 
valid).
  *   English and German documents must output into one ranking. I need to be 
able to compare the relevance of an English document to the relevance of a 
German document.
  *   the text parts need to stay separate, I want to boost the importance of 
(let's say part1) over the other.

My general approach so far has been to get a German translation of the user's 
query by sending it to a translation API. Then I want use an edismax query, 
since it seems to fulfill all of my requirements. The problem is that I cannot 
manage to search for the German query in the German fields and the English 
query in the English fields only. The Solr edismax 
documentation
 states that it supports the full Lucene query parser syntax, but I can't find 
a way to address different fields with different inputs. I tried:

q=text_part1_en: (A sentence in English) text_part1_de: (Ein Satz auf Deutsch) 
text_part2_en: (A sentence in English) text_part2_de: (Ein Satz auf Deutsch)
qf=text_part1_en text_part2_en text_part1_de text_part2_de


This syntax should be in line with what MatsLindh wrote in this 
thread.
 I tried different versions of writing this q, but whatever I do Solr always 
search for the full q string in all four fields given by qf, which totally 
messes up the result. Am I just making mistakes in the query syntax or is it 
even possible to do what I'm trying to do using edismax?

Any help would be highly appreciated.


Re: Limit Solr Disk IO

2020-06-05 Thread Erick Erickson
Have you considered TLOG/PULL replicas rather than NRT replicas? 
That way, all the indexing happens on a single machine and you can
use shards.preference to confine the searches happen on the PULL replicas,
see:  https://lucene.apache.org/solr/guide/7_7/distributed-requests.html

No, you can’t really limit the number of segments. While that seems like a
good idea, it quickly becomes counter-productive. Say you require that you
have 10 segments. Say each one becomes 10G. What happens when the 11th
segment is created and it’s 100M? Do you rewrite one of the 10G segments just
to add 100M? Your problem gets worse, not better.


Best,
Erick

> On Jun 5, 2020, at 1:41 AM, Anshuman Singh  wrote:
> 
> Hi Nicolas,
> 
> Commit happens automatically at 100k documents. We don't commit explicitly.
> We didn't limit the number of segments. There are 35+ segments in each core.
> But unrelated to the question, I would like to know if we can limit the
> number of segments in the core. I tried it in the past but the merge
> policies don't allow that.
> The TieredMergePolicy has two parameters, maxMergeAtOnce and
> segmentsPerTier. It seems like we cannot control the total number of
> segments but only the segments per tier.(
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> )
> 
> 
> On Thu, Jun 4, 2020 at 5:48 PM Nicolas Franck 
> wrote:
> 
>> The real questions are:
>> 
>> * how much often do you commit (either explicitly or automatically)?
>> * how much segments do you allow? If you only allow 1 segment,
>>  then that whole segment is recreated using the old documents and the
>> updates.
>>  And yes, that requires reading the old segment.
>>  It is common to allow multiple segments when you update often,
>>  so updating does not interfere with reading the index too often.
>> 
>> 
>>> On 4 Jun 2020, at 14:08, Anshuman Singh 
>> wrote:
>>> 
>>> I noticed that while indexing, when commit happens, there is high disk
>> read
>>> by Solr. The problem is that it is impacting search performance when the
>>> index is loaded from the disk with respect to the query, as the disk read
>>> speed is not quite good and the whole index is not cached in RAM.
>>> 
>>> When no searching is performed, I noticed that disk is usually read
>> during
>>> commit operations and sometimes even without commit at low rate. I guess
>> it
>>> is read due to segment merge operations. Can it be something else?
>>> If it is merging, can we limit disk IO during merging?
>> 
>> 



Atomic updates with nested documents

2020-06-05 Thread Ludger Steens
Dear Community,



I am using Solr 7.7 and I am wondering how it is possible to do a partial
update on nested documents / child documents.

Suppose I have committed the following documents to the index:

[

  {

"id": "1",

"testString": "1",

"testInt": "1",

"_childDocuments_": [

  {

"id": "1.1",

"child_type": "child_a",

"testString": "1.1",

"testInt": "1"

  },

  {

"id": "1.2",

"child_type": "child_a",

"testString": "1.1",

"testInt": "1"

  }

]

  }

]

 is id, all fields are indexed.



Now I want to update testInt to 2 on the parent document without losing the
parent child relation (ChildDocTransformerFactory should still produce
correct results).

I tried the following variants, both not successful:



*Variant 1:*

Sending the following update document to the update-Endpoint

[

  {

"id": "1",

"testInt": {

  "set": "2"

}

  }

]

The parent document is updated, but the ChildDocTransformerFactory does not
return any child documents



*Variant 2:*

Sending the following update document to the update-Endpoint

[

  {

"id": "1",

"testInt": {

  "set": "2"

},

"_childDocuments_": [

  {

"id": {

  "set": "1.1"

}

  },

  {

"id": {

  "set": "1.2"

}

  }

]

  }

]

Same result: Parent is updated, but ChildDocTransformerFactory does not
return any child documents





Is there any other way of doing a partial update without losing the parent
child relation?

Resending the complete document with all attributes and children would work
but is inefficient for us (we had to load all documents from Solr before
resending them).



Thanks in advance for your help



Ludger


--

*„Beste Arbeitgeber ITK 2020“ - 1. Platz für QAware*
ausgezeichnet von Great Place to Work

--

Ludger Steens
Softwarearchitekt

QAware GmbH
Aschauer Straße 32
81549 München, Germany
Mobil +49 175 7973969
ludger.ste...@qaware.de
www.qaware.de
--

Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
Registergericht: München
Handelsregisternummer: HRB 163761


Re: Solr takes time to warm up core with huge data

2020-06-05 Thread Erick Erickson
My suspicion, as others have said, is that you simply have too much data on
too little hardware. Solr definitely should not be taking this long. Or rather,
if Solr is taking this long to start up you have a badly undersized system and
until you address that you’ll just be going ‘round in circles.

Lucene uses MMapDirectory to use OS memory space for almost all of the
actual index, see: 
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
and you have 82G of index, and you only have 8G of OS memory space to hold it.

It’s certainly worth looking at how you use your index and whether you can 
make it smaller, but I’d say you simply won’t get satisfactory performance on 
such
constrained hardware.

You really need to go through “the sizing exercise” to see what your hardware 
and
usage patterns are, see: 
https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

> On Jun 5, 2020, at 3:48 AM, Srinivas Kashyap 
>  wrote:
> 
> Hi Jörn,
> 
> I think, you missed my explanation. We are not using sorting now:
> 
> The original query:
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> But now, I have removed sorting as shown below. The sorting is being done 
> outside solr:
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000
> 
> Also, we are writing custom code to index by discarding DIH too. When I 
> restart the solr, this core with huge data takes time to even show up the 
> query admin GUI console. It takes around 2 hours to show.
> 
> My question is, even for the simple query with filter query mentioned as 
> shown above, it is consuming JVM memory. So, how much memory or what 
> configuration should I be doing on solrconfig.xml to make it work.
> 
> Thanks,
> Srinivas
> 
> From: Jörn Franke 
> Sent: 05 June 2020 12:30
> To: solr-user@lucene.apache.org
> Subject: Re: Solr takes time to warm up core with huge data
> 
> I think DIH is the wrong solution for this. If you do an external custom load 
> you will be probably much faster.
> 
> You have too much JVM memory from my point of view. Reduce it to eight or 
> similar.
> 
> It seems you are just exporting data so you are better off work the exporting 
> handler.
> Add docvalues to the fields for this. It looks like you have no text field to 
> be searched but only simple fields (string, date etc).
> 
> You should not use the normal handler to return many results at once. If you 
> cannot use the Export handler then use cursors :
> 
> https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors
> 
> Both work to sort large result sets without consuming the whole memory
> 
>> Am 05.06.2020 um 08:18 schrieb Srinivas Kashyap 
>> mailto:srini...@bamboorose.com.invalid>>:
>> 
>> Thanks Shawn,
>> 
>> The filter queries are not complex. Below are the filter queries I’m running 
>> for the corresponding schema entry:
>> 
>> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
>> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
>> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
>> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 
>> asc,PHY_KEY7 asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>> 
>> This was the original query. Since there were lot of sorting fields, we 
>> decided to not do on the solr side, instead fetch the query response and do 
>> the sorting outside solr. This eliminated the need of more JVM memory which 
>> was allocated. Every time we ran this query, solr would crash exceeding the 
>> JVM memory. Now we are only running filter queries.
>> 
>> And regarding the filter cache, it is in default setup: (we are using 
>> default solrconfig.xml, and we have only added the request handler for DIH)
>> 
>> > size="512"
>> initialSize="512"
>> autowarmCount="0"/>
>> 
>> Now that you’re aware of the size and numbers, can you please let me know 
>> what values/size that I need to increase? Is there an advantage of moving 
>> this single core to solr cloud? If yes, can you let us know, how many 
>> shards/replica do we require for this core considering we allow it to grow 
>> as users transact. The updates to this core is not thru DIH delta import 
>> rather, we are using SolrJ to push the changes.
>> 
>> 
>> > omitTermFreqAndPositions="true" />
>> > omitTermFreqAndPositions="true" />
>> > omitTermFreqAndPositions="true" />
>> > omitTermFreqAndPositions="true" />
>> > omitTermFreqAndPositions="true" />
>> > omitTermFreqAndPositions="true" />
>> > omitTermFreqAndPositions="true" 

Re: CDCR behaviour

2020-06-05 Thread Jason Gerlowski
Hi Daniel,

Just a heads up that attachments and images are stripped pretty
aggressively by the mailing list - none of your images made it through.
You might more success linking to the images in Dropbox or some other
online storage medium.

Best,

Jason

On Thu, Jun 4, 2020 at 10:55 AM Gell-Holleron, Daniel <
daniel.gell-holle...@gb.unisys.com> wrote:

> Hi,
>
>
>
> Looks for some advice, sent a few questions on CDCR the last couple of
> days.
>
>
>
> I just want to see if this is expected behavior from Solr or not?
>
>
>
> When a document is added to Site A, it is then supposed to replicate
> across, however in the statistics page I see the following:
>
>
>
> Site A
>
>
>
>
> Site B
>
>
>
>
>
> When I perform a search on Site B through the Solr admin page, I do get
> results (which I find strange). The only way for the numb docs parameter to
> be matching is restart Solr, I then get the below:
>
>
>
>
>
> I just want to know whether this behavior is expected or is a bug? My
> expectation is that the data will always be current between the two sites.
>
>
>
> Thanks,
>
> Daniel
>
>
>


SolrResourceLoader is not closed when SolrCore is closed

2020-06-05 Thread Michael Aleythe, Sternwald
Hi everybody,

I'm currently using solr 8.3.1. While working on tests for a solr plugin I'm 
writing , I realized that the SolrResourceLoader used by a SolrCore is not 
closed, when calling SolrCore.close().
This leads to open file handles. Can anybody explain this behavior?

Best Regards,

Michael Aleythe
Team --(sr)^(ch)--
Java Entwickler | STERNWALD SYSTEMS GMBH

Fon +49 351 31 40 6010
Fax +49 351 31 40 6001

E-Mail michael.aley...@sternwald.com
Skype michael.aley...@sternwald.com
Web www.sternwald.com

STERNWALD SYSTEMS GMBH
Pohlandstraße 19, 01309 Dresden, Germany
Geschäftsführer Ard Meier
Registergericht Handelsregister Dresden, HRB 33480
UmSt-ID DE157125091

SUPPORT / HOTLINE
Fon +49 173 38 54 752
E-Mail hotl...@sternwald.com
Web support.sternwald.net

STERNWALD Offices
Berlin | Dresden | Düsseldorf | Hamburg | Sofia | Würzburg



RE: Solr takes time to warm up core with huge data

2020-06-05 Thread Srinivas Kashyap
Hi Jörn,

I think, you missed my explanation. We are not using sorting now:

The original query:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc

But now, I have removed sorting as shown below. The sorting is being done 
outside solr:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000

Also, we are writing custom code to index by discarding DIH too. When I restart 
the solr, this core with huge data takes time to even show up the query admin 
GUI console. It takes around 2 hours to show.

My question is, even for the simple query with filter query mentioned as shown 
above, it is consuming JVM memory. So, how much memory or what configuration 
should I be doing on solrconfig.xml to make it work.

Thanks,
Srinivas

From: Jörn Franke 
Sent: 05 June 2020 12:30
To: solr-user@lucene.apache.org
Subject: Re: Solr takes time to warm up core with huge data

I think DIH is the wrong solution for this. If you do an external custom load 
you will be probably much faster.

You have too much JVM memory from my point of view. Reduce it to eight or 
similar.

It seems you are just exporting data so you are better off work the exporting 
handler.
Add docvalues to the fields for this. It looks like you have no text field to 
be searched but only simple fields (string, date etc).

You should not use the normal handler to return many results at once. If you 
cannot use the Export handler then use cursors :

https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors

Both work to sort large result sets without consuming the whole memory

> Am 05.06.2020 um 08:18 schrieb Srinivas Kashyap 
> mailto:srini...@bamboorose.com.invalid>>:
>
> Thanks Shawn,
>
> The filter queries are not complex. Below are the filter queries I’m running 
> for the corresponding schema entry:
>
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
>
> This was the original query. Since there were lot of sorting fields, we 
> decided to not do on the solr side, instead fetch the query response and do 
> the sorting outside solr. This eliminated the need of more JVM memory which 
> was allocated. Every time we ran this query, solr would crash exceeding the 
> JVM memory. Now we are only running filter queries.
>
> And regarding the filter cache, it is in default setup: (we are using default 
> solrconfig.xml, and we have only added the request handler for DIH)
>
>  size="512"
> initialSize="512"
> autowarmCount="0"/>
>
> Now that you’re aware of the size and numbers, can you please let me know 
> what values/size that I need to increase? Is there an advantage of moving 
> this single core to solr cloud? If yes, can you let us know, how many 
> shards/replica do we require for this core considering we allow it to grow as 
> users transact. The updates to this core is not thru DIH delta import rather, 
> we are using SolrJ to push the changes.
>
> 
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>  omitTermFreqAndPositions="true" />
>
>
> Thanks,
> Srinivas
>
>
>
>> On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:
>> We are on solr 8.4.1 and In standalone server mode. We have a core with 
>> 497,767,038 Records indexed. It took around 32Hours to load data through DIH.
>>
>> The disk occupancy is shown below:
>>
>> 82G /var/solr/data//data/index
>>
>> When I restarted solr instance and went to this core to query on solr admin 
>> GUI, it is hanging and is showing "Connection to Solr lost. Please check the 
>> Solr instance". But when I go back to dashboard, instance is up and I'm able 
>> to query other cores.
>>
>> Also, querying on this core is eating up JVM memory allocated(24GB)/(32GB 
>> RAM). A query(*:*) with filterqueries is overshooting the memory with OOM.
>
> You're going to want to have a lot more than 8GB available memory for
> disk caching with an 82GB index. That's a performance thing... with so
> little caching memory, Solr will be slow, but functional. That aspect
> of your setup 

Re: Solr takes time to warm up core with huge data

2020-06-05 Thread Jörn Franke
I think DIH is the wrong solution for this. If you do an external custom load 
you will be probably much faster.

You have too much JVM memory from my point of view. Reduce it to eight or 
similar.

It seems you are just exporting data so you are better off work the exporting 
handler.
Add docvalues to the fields for this. It looks like you have no text field to 
be searched but only simple fields (string, date etc).

 You should not use the normal handler to return many results at once. If you 
cannot use the Export handler then use cursors :

https://lucene.apache.org/solr/guide/8_4/pagination-of-results.html#using-cursors

Both work to sort large result sets without consuming the whole memory

> Am 05.06.2020 um 08:18 schrieb Srinivas Kashyap 
> :
> 
> Thanks Shawn,
> 
> The filter queries are not complex. Below are the filter queries I’m running 
> for the corresponding schema entry:
> 
> q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
> *]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
> desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
> asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
> asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc
> 
> This was the original query. Since there were lot of sorting fields, we 
> decided to not do on the solr side, instead fetch the query response and do 
> the sorting outside solr. This eliminated the need of more JVM memory which 
> was allocated. Every time we ran this query, solr would crash exceeding the 
> JVM memory. Now we are only running filter queries.
> 
> And regarding the filter cache, it is in default setup: (we are using default 
> solrconfig.xml, and we have only added the request handler for DIH)
> 
>  size="512"
> initialSize="512"
> autowarmCount="0"/>
> 
> Now that you’re aware of the size and numbers, can you please let me know 
> what values/size that I need to increase? Is there an advantage of moving 
> this single core to solr cloud? If yes, can you let us know, how many 
> shards/replica do we require for this core considering we allow it to grow as 
> users transact. The updates to this core is not thru DIH delta import rather, 
> we are using SolrJ to push the changes.
> 
> 
>   type="string"  indexed="true"  stored="true"
> omitTermFreqAndPositions="true" />
> type="date"indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>   type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
>  type="string"  indexed="true"  
> stored="true"omitTermFreqAndPositions="true" />
> 
> 
> Thanks,
> Srinivas
> 
> 
> 
>> On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:
>> We are on solr 8.4.1 and In standalone server mode. We have a core with 
>> 497,767,038 Records indexed. It took around 32Hours to load data through DIH.
>> 
>> The disk occupancy is shown below:
>> 
>> 82G /var/solr/data//data/index
>> 
>> When I restarted solr instance and went to this core to query on solr admin 
>> GUI, it is hanging and is showing "Connection to Solr lost. Please check the 
>> Solr instance". But when I go back to dashboard, instance is up and I'm able 
>> to query other cores.
>> 
>> Also, querying on this core is eating up JVM memory allocated(24GB)/(32GB 
>> RAM). A query(*:*) with filterqueries is overshooting the memory with OOM.
> 
> You're going to want to have a lot more than 8GB available memory for
> disk caching with an 82GB index. That's a performance thing... with so
> little caching memory, Solr will be slow, but functional. That aspect
> of your setup will NOT lead to out of memory.
> 
> If you are experiencing Java "OutOfMemoryError" exceptions, you will
> need to figure out what resource is running out. It might be heap
> memory, but it also might 

RE: Solr takes time to warm up core with huge data

2020-06-05 Thread Srinivas Kashyap
Thanks Shawn,

The filter queries are not complex. Below are the filter queries I’m running 
for the corresponding schema entry:

q=*:*=PARENT_DOC_ID:100=MODIFY_TS:[1970-01-01T00:00:00Z TO 
*]=PHY_KEY2:"HQ012206"=PHY_KEY1:"JACK"=1000=MODIFY_TS 
desc,LOGICAL_SECT_NAME asc,TRACK_ID desc,TRACK_INTER_ID asc,PHY_KEY1 
asc,PHY_KEY2 asc,PHY_KEY3 asc,PHY_KEY4 asc,PHY_KEY5 asc,PHY_KEY6 asc,PHY_KEY7 
asc,PHY_KEY8 asc,PHY_KEY9 asc,PHY_KEY10 asc,FIELD_NAME asc

This was the original query. Since there were lot of sorting fields, we decided 
to not do on the solr side, instead fetch the query response and do the sorting 
outside solr. This eliminated the need of more JVM memory which was allocated. 
Every time we ran this query, solr would crash exceeding the JVM memory. Now we 
are only running filter queries.

And regarding the filter cache, it is in default setup: (we are using default 
solrconfig.xml, and we have only added the request handler for DIH)



Now that you’re aware of the size and numbers, can you please let me know what 
values/size that I need to increase? Is there an advantage of moving this 
single core to solr cloud? If yes, can you let us know, how many shards/replica 
do we require for this core considering we allow it to grow as users transact. 
The updates to this core is not thru DIH delta import rather, we are using 
SolrJ to push the changes.
















Thanks,
Srinivas



On 6/4/2020 9:51 PM, Srinivas Kashyap wrote:
> We are on solr 8.4.1 and In standalone server mode. We have a core with 
> 497,767,038 Records indexed. It took around 32Hours to load data through DIH.
>
> The disk occupancy is shown below:
>
> 82G /var/solr/data//data/index
>
> When I restarted solr instance and went to this core to query on solr admin 
> GUI, it is hanging and is showing "Connection to Solr lost. Please check the 
> Solr instance". But when I go back to dashboard, instance is up and I'm able 
> to query other cores.
>
> Also, querying on this core is eating up JVM memory allocated(24GB)/(32GB 
> RAM). A query(*:*) with filterqueries is overshooting the memory with OOM.

You're going to want to have a lot more than 8GB available memory for
disk caching with an 82GB index. That's a performance thing... with so
little caching memory, Solr will be slow, but functional. That aspect
of your setup will NOT lead to out of memory.

If you are experiencing Java "OutOfMemoryError" exceptions, you will
need to figure out what resource is running out. It might be heap
memory, but it also might be that you're hitting the process/thread
limit of your operating system. And there are other possible causes for
that exception too. Do you have the text of the exception available?
It will be absolutely critical for you to determine what resource is
running out, or you might focus your efforts on the wrong thing.

If it's heap memory (something that I can't really assume), then Solr is
requiring more than the 24GB heap you've allocated.

Do you have faceting or grouping on those queries? Are any of your
filters really large or complex? These are the things that I would
imagine as requiring lots of heap memory.

What is the size of your filterCache? With about 500 million documents
in the core, each entry in the filterCache will consume nearly 60
megabytes of memory. If your filterCache has the default example size
of 512, and it actually gets that big, then that single cache will
require nearly 30 gigabytes of heap memory (on top of the other things
in Solr that require heap) ... and you only have 24GB. That could cause
OOME exceptions.

Does the server run things other than Solr?

Look here for some valuable info about performance and memory:

https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

Thanks,
Shawn

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more 

RE: Autoscaling using SolrCloud8.5 on AWS EKS - issue with Node Added trigger

2020-06-05 Thread Mangla,Kirti
Hi,

I am working with solr 8.5.2 on AWS EKS.
Autoscaling trigger NodeAddedTrigger is not working and throwing Java error
Caused by: org.apache.solr.common.SolrException: 
org.apache.solr.common.SolrException: Error getting remote info

Looking for some help for troubleshooting it.

Thanks,
Kirti Mangla
Software Engineer- Gartner Digital Markets - GetApp
Two Horizon Center, Golf Course Road, Gurgaon, India
Direct:  +91124-4795963
[logo_small]

From: Mangla,Kirti
Sent: Wednesday, June 3, 2020 12:29 AM
To: solr-user@lucene.apache.org
Subject: Autoscaling using SolrCloud8.5 on AWS EKS - issue with Node Added 
trigger

Hi,

I have been trying to enable autoscaling on SolrCloud 8.5, with Node Added 
trigger and Node Lost trigger. The SolrCloud is running on AWS EKS pods, with 2 
nodes minimum.
I have added NodeAddedTrigger. My autoscaling API response looks like as in the 
attached file.

Whenever I scale up the SolrCloud replicas on EKS, new nodes are added to the 
cluster but the Node Added trigger throws below error:

org.apache.solr.common.SolrException: Unexpected 
exception while processing event: {
"id":"c889e6ef3b34eTcc9nazth0kbod28rj2zc84n0b",
"source":"node_added_trigger",
"eventTime":3527913768203086,
"eventType":"NODEADDED",
"properties":{
"eventTimes":[3527913768203086],
"preferredOperation":"addreplica",
"_enqueue_time_":3527918773192489,
"nodeNames":["solrcloud-2.solrcluster:8983_solr"],
"replicaType":"NRT"}}
at 
org.apache.solr.cloud.autoscaling.ComputePlanAction.process(ComputePlanAction.java:161)
at 
org.apache.solr.cloud.autoscaling.ScheduledTriggers.lambda$null$3(ScheduledTriggers.java:326)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown 
Source)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: 
org.apache.solr.common.SolrException: Error getting remote info
at 
org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:78)
at 
org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.fetchTagValues(SolrClientNodeStateProvider.java:139)
at 
org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.getNodeValues(SolrClientNodeStateProvider.java:128)
at 
org.apache.solr.client.solrj.cloud.autoscaling.Row.init(Row.java:71)
at 
org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.init(Policy.java:575)
at 
org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:396)
at 
org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:358)
at 
org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.createSession(PolicyHelper.java:492)
at 
org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.get(PolicyHelper.java:457)
at 
org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSession(PolicyHelper.java:513)
at 
org.apache.solr.cloud.autoscaling.ComputePlanAction.process(ComputePlanAction.java:90)
... 7 more
Caused by: org.apache.solr.common.SolrException: Error 
getting remote info
at 
org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo(SolrClientNodeStateProvider.java:364)
at 
org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:76)
... 17 more
Caused by: org.apache.solr.common.SolrException: Could not 
get remote info after many retries on NoHttpResponseException
at 
org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo(SolrClientNodeStateProvider.java:335)
... 18 more
 

Looking for help on the subject.
Please let me know for doubts.

Thanks,
Kirti Mangla




If you are not the intended recipient or have received this message in error, 
please notify the sender and permanently delete this message and any 
attachments.