Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread David Smiley
I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms of
quality of the highlight since there are vastly more breaks to pick from.
I think that setting is more useful in SENTENCE mode if you can stand the
perf hit.  If you agree, then why not just let this one default to "true"?

We agree on better documenting the perf trade-off.

Thanks again for working on these settings, BTW.

~ David


On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi 
wrote:

> Since the issue seems to be affecting the highlighter differently
> based on which mode it is using, having different defaults for the
> modes could be explored.
>
> WORD may have the new defaults as it has little effect on performance
> and it creates nicer highlights.
> SENTENCE should have the defaults that produce reasonable performance.
> The docs could document this while also mentioning that the UH's
> performance is highly dependent on the underlying Java String/Text?
> Iterator.
>
> One can argue that having different defaults based on mode is
> confusing. In this case I think the defaults should be changed to have
> the SENTENCE mode perform better. Maybe the options for nice
> highlights with WORD mode could be put into the docs in this case as
> some form of an example.
>
> As long as I can use the UH with nicely aligned snippets in WORD mode
> I'm fine with any defaults. I explicitly set them in the config and in
> the queries most of the time anyways.
>


Re: Out of memory errors with Spatial indexing

2020-07-03 Thread David Smiley
Hi Sunil,

Your shape is at a pole, and I'm aware of a bug causing an exponential
explosion of needed grid squares when you have polygons super-close to the
pole.  Might you try S2PrefixTree instead?  I forget if this would fix it
or not by itself.  For indexing non-point data, I recommend
class="solr.RptWithGeometrySpatialField" which internally is based off a
combination of a course grid and storing the original vector geometry for
accurate verification:

The internally coarser grid will lessen the impact of that pole bug.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma  wrote:

> We are seeing OOM errors  when trying to index some spatial data. I believe
> the data itself might not be valid but it shouldn't cause the Server to
> crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is
> causing the error.
>
> {
> "id": "bad_data_1",
> "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
> 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
> 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
> 1.000150474662E30)"
> }
>
> Above dynamic field is mapped to field type "location_rpt" (
> solr.SpatialRecursivePrefixTreeFieldType).
>
>   Any pointers to get around this issue would be highly appreciated.
>
> Thanks!
>


Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Erick Erickson
Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
_that’s_ a red flag.

> On Jul 3, 2020, at 5:53 PM, Erick Erickson  wrote:
> 
> You haven’t said how many _shards_ are present. Nor how many replicas of the 
> collection you’re hosting per physical machine. Nor how large the indexes are 
> on disk. Those are the numbers that count. The latter is somewhat fuzzy, but 
> if your aggregate index size on a machine with, say, 128G of memory is a 
> terabyte, that’s a red flag.
> 
> Short form, though is yes. Subject to the questions above, this is what I’d 
> be looking at first.
> 
> And, as I said, if you’ve been steadily increasing the total number of 
> documents, you’ll reach a tipping point sometime.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
>> 
>> Hi Eric,
>> 
>> The collection has almost 13billion documents with each document around 5kb 
>> size, all the columns around 150 are the indexed. Do you think that number 
>> of documents in the collection causing this issue. Appreciate your response.
>> 
>> Regards,
>> Madhava 
>> 
>> Sent from my iPhone
>> 
>>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>>> 
>>> If you’re seeing low CPU utilization at the same time, you probably
>>> just have too much data on too little hardware. Check your
>>> swapping, how much of your I/O is just because Lucene can’t
>>> hold all the parts of the index it needs in memory at once? Lucene
>>> uses MMapDirectory to hold the index and you may well be
>>> swapping, see:
>>> 
>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>> 
>>> But my guess is that you’ve just reached a tipping point. You say:
>>> 
>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
>>> errors while indexing”
>>> 
>>> So have you been continually adding more documents to your
>>> collections for more than the 2-3 weeks? If so you may have just
>>> put so much data on the same boxes that you’ve gone over
>>> the capacity of your hardware. As Toke says, adding physical
>>> memory for the OS to use to hold relevant parts of the index may
>>> alleviate the problem (again, refer to Uwe’s article for why).
>>> 
>>> All that said, if you’re going to keep adding document you need to
>>> seriously think about adding new machines and moving some of
>>> your replicas to them.
>>> 
>>> Best,
>>> Erick
>>> 
 On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
 
> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
> We are performing QA performance testing on couple of collections
> which holds 2 billion and 3.5 billion docs respectively.
 
 How many shards?
 
> 1.  Our performance team noticed that read operations are pretty
> more than write operations like 100:1 ratio, is this expected during
> indexing or solr nodes are doing any other operations like syncing?
 
 Are you saying that there are 100 times more read operations when you
 are indexing? That does not sound too unrealistic as the disk cache
 might be filled with the data that the writers are flushing.
 
 In that case, more RAM would help. Okay, more RAM nearly always helps,
 but such massive difference in IO-utilization does indicate that you
 are starved for cache.
 
 I noticed you have at least 18 replicas. That's a lot. Just to sanity
 check: How many replicas are each physical box handling? If they are
 sharing resources, fewer replicas would probably be better.
 
> 3.  Our client timeout is set to 2mins, can they increase further
> more? Would that help or create any other problems?
 
 It does not hurt the server to increase the client timeout as the
 initiated query will keep running until it is finished, independent of
 whether or not there is a client to receive the result.
 
 If you want a better max time for query processing, you should look at 
 
 https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
 but due to its inherent limitations it might not help in your
 situation.
 
> 4.  When we created an empty collection and loaded same data file,
> it loaded fine without any issues so having more documents in a
> collection would create such problems?
 
 Solr 7 does have a problem with sparse DocValues and many documents,
 leading to excessive IO-activity, which might be what you are seeing. I
 can see from an earlier post that you were using streaming expressions
 for another collection: This is one of the things that are affected by
 the Solr 7 DocValues issue.
 
 More info about DocValues and streaming:
 https://issues.apache.org/jira/browse/SOLR-13013
 
 Fairly in-depth info on the problem with Solr 7 docValues:
 https://issues.apache.org/jira/browse/LUCENE-8374
 
 If this is your problem, upgrading to Solr 8 and indexing the

Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Erick Erickson
You haven’t said how many _shards_ are present. Nor how many replicas of the 
collection you’re hosting per physical machine. Nor how large the indexes are 
on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if 
your aggregate index size on a machine with, say, 128G of memory is a terabyte, 
that’s a red flag.

Short form, though is yes. Subject to the questions above, this is what I’d be 
looking at first.

And, as I said, if you’ve been steadily increasing the total number of 
documents, you’ll reach a tipping point sometime.

Best,
Erick

> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
> 
> Hi Eric,
> 
> The collection has almost 13billion documents with each document around 5kb 
> size, all the columns around 150 are the indexed. Do you think that number of 
> documents in the collection causing this issue. Appreciate your response.
> 
> Regards,
> Madhava 
> 
> Sent from my iPhone
> 
>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>> 
>> If you’re seeing low CPU utilization at the same time, you probably
>> just have too much data on too little hardware. Check your
>> swapping, how much of your I/O is just because Lucene can’t
>> hold all the parts of the index it needs in memory at once? Lucene
>> uses MMapDirectory to hold the index and you may well be
>> swapping, see:
>> 
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> But my guess is that you’ve just reached a tipping point. You say:
>> 
>> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
>> errors while indexing”
>> 
>> So have you been continually adding more documents to your
>> collections for more than the 2-3 weeks? If so you may have just
>> put so much data on the same boxes that you’ve gone over
>> the capacity of your hardware. As Toke says, adding physical
>> memory for the OS to use to hold relevant parts of the index may
>> alleviate the problem (again, refer to Uwe’s article for why).
>> 
>> All that said, if you’re going to keep adding document you need to
>> seriously think about adding new machines and moving some of
>> your replicas to them.
>> 
>> Best,
>> Erick
>> 
>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
>>> 
 On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
 We are performing QA performance testing on couple of collections
 which holds 2 billion and 3.5 billion docs respectively.
>>> 
>>> How many shards?
>>> 
 1.  Our performance team noticed that read operations are pretty
 more than write operations like 100:1 ratio, is this expected during
 indexing or solr nodes are doing any other operations like syncing?
>>> 
>>> Are you saying that there are 100 times more read operations when you
>>> are indexing? That does not sound too unrealistic as the disk cache
>>> might be filled with the data that the writers are flushing.
>>> 
>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>> but such massive difference in IO-utilization does indicate that you
>>> are starved for cache.
>>> 
>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>> check: How many replicas are each physical box handling? If they are
>>> sharing resources, fewer replicas would probably be better.
>>> 
 3.  Our client timeout is set to 2mins, can they increase further
 more? Would that help or create any other problems?
>>> 
>>> It does not hurt the server to increase the client timeout as the
>>> initiated query will keep running until it is finished, independent of
>>> whether or not there is a client to receive the result.
>>> 
>>> If you want a better max time for query processing, you should look at 
>>> 
>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>> but due to its inherent limitations it might not help in your
>>> situation.
>>> 
 4.  When we created an empty collection and loaded same data file,
 it loaded fine without any issues so having more documents in a
 collection would create such problems?
>>> 
>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>> leading to excessive IO-activity, which might be what you are seeing. I
>>> can see from an earlier post that you were using streaming expressions
>>> for another collection: This is one of the things that are affected by
>>> the Solr 7 DocValues issue.
>>> 
>>> More info about DocValues and streaming:
>>> https://issues.apache.org/jira/browse/SOLR-13013
>>> 
>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>> 
>>> If this is your problem, upgrading to Solr 8 and indexing the
>>> collection from scratch should fix it. 
>>> 
>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>> or you can ensure that there are values defined for all DocValues-
>>> fields in all your documents.
>>> 
 java.net.SocketTimeoutException: Read timed 

Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Mad have
Hi Eric,

The collection has almost 13billion documents with each document around 5kb 
size, all the columns around 150 are the indexed. Do you think that number of 
documents in the collection causing this issue. Appreciate your response.

Regards,
Madhava 

Sent from my iPhone

> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
> 
> If you’re seeing low CPU utilization at the same time, you probably
> just have too much data on too little hardware. Check your
> swapping, how much of your I/O is just because Lucene can’t
> hold all the parts of the index it needs in memory at once? Lucene
> uses MMapDirectory to hold the index and you may well be
> swapping, see:
> 
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> 
> But my guess is that you’ve just reached a tipping point. You say:
> 
> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
> errors while indexing”
> 
> So have you been continually adding more documents to your
> collections for more than the 2-3 weeks? If so you may have just
> put so much data on the same boxes that you’ve gone over
> the capacity of your hardware. As Toke says, adding physical
> memory for the OS to use to hold relevant parts of the index may
> alleviate the problem (again, refer to Uwe’s article for why).
> 
> All that said, if you’re going to keep adding document you need to
> seriously think about adding new machines and moving some of
> your replicas to them.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
>> 
>>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>>> We are performing QA performance testing on couple of collections
>>> which holds 2 billion and 3.5 billion docs respectively.
>> 
>> How many shards?
>> 
>>> 1.  Our performance team noticed that read operations are pretty
>>> more than write operations like 100:1 ratio, is this expected during
>>> indexing or solr nodes are doing any other operations like syncing?
>> 
>> Are you saying that there are 100 times more read operations when you
>> are indexing? That does not sound too unrealistic as the disk cache
>> might be filled with the data that the writers are flushing.
>> 
>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>> but such massive difference in IO-utilization does indicate that you
>> are starved for cache.
>> 
>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>> check: How many replicas are each physical box handling? If they are
>> sharing resources, fewer replicas would probably be better.
>> 
>>> 3.  Our client timeout is set to 2mins, can they increase further
>>> more? Would that help or create any other problems?
>> 
>> It does not hurt the server to increase the client timeout as the
>> initiated query will keep running until it is finished, independent of
>> whether or not there is a client to receive the result.
>> 
>> If you want a better max time for query processing, you should look at 
>> 
>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>> but due to its inherent limitations it might not help in your
>> situation.
>> 
>>> 4.  When we created an empty collection and loaded same data file,
>>> it loaded fine without any issues so having more documents in a
>>> collection would create such problems?
>> 
>> Solr 7 does have a problem with sparse DocValues and many documents,
>> leading to excessive IO-activity, which might be what you are seeing. I
>> can see from an earlier post that you were using streaming expressions
>> for another collection: This is one of the things that are affected by
>> the Solr 7 DocValues issue.
>> 
>> More info about DocValues and streaming:
>> https://issues.apache.org/jira/browse/SOLR-13013
>> 
>> Fairly in-depth info on the problem with Solr 7 docValues:
>> https://issues.apache.org/jira/browse/LUCENE-8374
>> 
>> If this is your problem, upgrading to Solr 8 and indexing the
>> collection from scratch should fix it. 
>> 
>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>> or you can ensure that there are values defined for all DocValues-
>> fields in all your documents.
>> 
>>> java.net.SocketTimeoutException: Read timed out
>>>   at java.net.SocketInputStream.socketRead0(Native Method) 
>> ...
>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>> timeout expired: 60/60 ms
>> 
>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>> should be able to change it in solr.xml.
>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>> 
>> BUT if an update takes > 10 minutes to be processed, it indicates that
>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>> 
>> - Toke Eskildsen, Royal Danish Library
>> 
>> 
> 


Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread Nándor Mátravölgyi
Since the issue seems to be affecting the highlighter differently
based on which mode it is using, having different defaults for the
modes could be explored.

WORD may have the new defaults as it has little effect on performance
and it creates nicer highlights.
SENTENCE should have the defaults that produce reasonable performance.
The docs could document this while also mentioning that the UH's
performance is highly dependent on the underlying Java String/Text?
Iterator.

One can argue that having different defaults based on mode is
confusing. In this case I think the defaults should be changed to have
the SENTENCE mode perform better. Maybe the options for nice
highlights with WORD mode could be put into the docs in this case as
some form of an example.

As long as I can use the UH with nicely aligned snippets in WORD mode
I'm fine with any defaults. I explicitly set them in the config and in
the queries most of the time anyways.


Re: unified highlighter performance in solr 8.5.1

2020-07-03 Thread David Smiley
I think we should flip the default of hl.fragsizeIsMinimum to be 'true',
thus have the behavior close to what preceded 8.5.
(a) it was very recently (<= 8.4) the previous behavior and so may require
less tuning for users in 8.6 henceforth
(b) it's significantly faster for long text -- seems to be 2x to 5x for
long documents (assuming no change in hl.fragAlignRatio).  If the user
additionally configures hl.fragAlignRatio to 0 (also the previous behavior;
0.5 is the new default), I saw another 6x on top of that for "doc3" in the
test data Michal prepared.

Although I like that the sizing looks nicer, I think that is more from the
introduction and new default of hl.fragAlignRatio=0.5 than it is
hl.fragsizeIsMinimum=false.  We might even consider lowering
hl.fragAlignRatio to say 0.3 and retain pretty reasonable highlights
(avoids the extreme cases occurring with '0') and additional performance
benefit from that.

What do you think Nandor, Michal?

I'm hoping a change in settings (+ some better notes/docs on this) could
slip into an 8.6, all done by myself ASAP.

~ David


On Fri, Jun 19, 2020 at 2:32 PM Nándor Mátravölgyi 
wrote:

> Hi!
>
> With the provided test I've profiled the preceding() and following()
> calls on the base Java iterators in the different options.
>
> === default highlighter arguments ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 1130 calls of
> baseIter.preceding() took 1.039629 seconds in total
> - from LengthGoalBreakIterator.following(): 1140 calls of
> baseIter.following() took 0.340679 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1150 calls of
> baseIter.preceding() took 0.099344 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1100 calls of
> baseIter.following() took 0.015156 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1200 calls of
> baseIter.preceding() took 0.001006 seconds in total
> - from LengthGoalBreakIterator.following(): 1700 calls of
> baseIter.following() took 0.006278 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1710 calls of
> baseIter.preceding() took 0.016320 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1090 calls of
> baseIter.following() took 0.000527 seconds in total
>
> === hl.fragsizeIsMinimum=true=0 ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 860 calls of
> baseIter.following() took 0.012593 seconds in total
> - from LengthGoalBreakIterator.preceding(): 870 calls of
> baseIter.preceding() took 0.022208 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1360 calls of
> baseIter.following() took 0.004789 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1370 calls of
> baseIter.preceding() took 0.015983 seconds in total
>
> === hl.fragsizeIsMinimum=true ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 980 calls of
> baseIter.following() took 0.010253 seconds in total
> - from LengthGoalBreakIterator.preceding(): 980 calls of
> baseIter.preceding() took 0.341997 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 1670 calls of
> baseIter.following() took 0.005150 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1680 calls of
> baseIter.preceding() took 0.013657 seconds in total
>
> === hl.fragAlignRatio=0 ===
> Calling the test query with SENTENCE base iterator:
> - from LengthGoalBreakIterator.following(): 1070 calls of
> baseIter.preceding() took 1.312056 seconds in total
> - from LengthGoalBreakIterator.following(): 1080 calls of
> baseIter.following() took 0.678575 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1080 calls of
> baseIter.preceding() took 0.020507 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1080 calls of
> baseIter.following() took 0.006977 seconds in total
>
> Calling the test query with WORD base iterator:
> - from LengthGoalBreakIterator.following(): 880 calls of
> baseIter.preceding() took 0.000706 seconds in total
> - from LengthGoalBreakIterator.following(): 1370 calls of
> baseIter.following() took 0.004110 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1380 calls of
> baseIter.preceding() took 0.014752 seconds in total
> - from LengthGoalBreakIterator.preceding(): 1380 calls of
> baseIter.following() took 0.000106 seconds in total
>
> There is definitely a big difference between SENTENCE and WORD. I'm
> not sure how we can improve the logic on our side while keeping the
> features as is. Since the number of calls is roughly the same for when
> the performance is good and bad, it seems to depend on what the text
> is that the iterator is traversing.
>


Re: Solr Float/Double multivalues fields

2020-07-03 Thread Thomas Corthals
Op vr 3 jul. 2020 om 14:11 schreef Bram Van Dam :

> On 03/07/2020 09:50, Thomas Corthals wrote:
> > I think this should go in the ref guide. If your product depends on this
> > behaviour, you want reassurance that it isn't going to change in the next
> > release. Not everyone will go looking through the javadoc to see if this
> is
> > implied.
>
> This is in the ref guide. Section DocValues. Here's the quote:
>
> DocValues are only available for specific field types. The types chosen
> determine the underlying Lucene
> docValue type that will be used. The available Solr field types are:
> • StrField, and UUIDField:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the SORTED type.
> ◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
> Entries are kept in sorted order and
> duplicates are removed.
> • BoolField:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the SORTED type.
> © 2019, Apache Software Foundation
>  Guide Version 7.7 - Published: 2019-03-04
> Page 212 of 1426
>  Apache Solr Reference Guide 7.7
> ◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
> Entries are kept in sorted order and
> duplicates are removed.
> • Any *PointField Numeric or Date fields, EnumFieldType, and
> CurrencyFieldType:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the NUMERIC type.
> ◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type.
> Entries are kept in sorted order
> and duplicates are kept.
> • Any of the deprecated Trie* Numeric or Date fields, EnumField and
> CurrencyField:
> ◦ If the field is single-valued (i.e., multi-valued is false), Lucene
> will use the NUMERIC type.
> ◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
> Entries are kept in sorted order and
> duplicates are removed.
> These Lucene types are related to how the values are sorted and stored.
>

Great for docValues. But I couldn't find anything similar for multiValued
in the field type pages of the ref guide (unless I totally missed it
of course). It doesn't have to be as elaborate, as long as it's clear and
doesn't leave users wondering or assuming.


Re: Solr Float/Double multivalues fields

2020-07-03 Thread Bram Van Dam
On 03/07/2020 09:50, Thomas Corthals wrote:
> I think this should go in the ref guide. If your product depends on this
> behaviour, you want reassurance that it isn't going to change in the next
> release. Not everyone will go looking through the javadoc to see if this is
> implied.

This is in the ref guide. Section DocValues. Here's the quote:

DocValues are only available for specific field types. The types chosen
determine the underlying Lucene
docValue type that will be used. The available Solr field types are:
• StrField, and UUIDField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the SORTED type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
Entries are kept in sorted order and
duplicates are removed.
• BoolField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the SORTED type.
© 2019, Apache Software Foundation
 Guide Version 7.7 - Published: 2019-03-04
Page 212 of 1426
 Apache Solr Reference Guide 7.7
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
Entries are kept in sorted order and
duplicates are removed.
• Any *PointField Numeric or Date fields, EnumFieldType, and
CurrencyFieldType:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type.
Entries are kept in sorted order
and duplicates are kept.
• Any of the deprecated Trie* Numeric or Date fields, EnumField and
CurrencyField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
Entries are kept in sorted order and
duplicates are removed.
These Lucene types are related to how the values are sorted and stored.





Re: Adding solr-core via maven fails

2020-07-03 Thread Erick Erickson
If you feel strongly that Solr needs to keep up the Maven bits
up to date, you can volunteer to help maintain it, Solr is
open source after all.


> On Jul 3, 2020, at 12:08 AM, Ali Akhtar  wrote:
> 
> I had to add an additional repository to get the failing dependency to
> resolve:
> 
> resolvers += "Spring Plugins Repository" at "
> https://repo.spring.io/plugins-release/;
> 
>> However, we do not officially support Maven builds,
> 
> Um, why? This is a java based project, and maven is the de-facto standard
> for Java. What if someone wanted to make use of any of Solr's java
> libraries in their own JVM based project? There's no (clean) way to do it
> other than adding it as a maven dependency and importing the class into
> their code.
> 
> 
> On Thu, Jul 2, 2020 at 6:07 PM Mike Drob  wrote:
> 
>> Does it fail similarly on 8.5.0 and .1?
>> 
>> On Thu, Jul 2, 2020 at 6:38 AM Erick Erickson 
>> wrote:
>> 
>>> There have been some issues with Maven, see:
>>> https://issues.apache.org/jira/browse/LUCENE-9170
>>> 
>>> However, we do not officially support Maven builds, they’re there as a
>>> convenience, so there may still
>>> be issues in future.
>>> 
 On Jul 2, 2020, at 1:27 AM, Ali Akhtar  wrote:
 
 If I try adding solr-core to an existing project, e.g (SBT):
 
 libraryDependencies += "org.apache.solr" % "solr-core" % "8.5.2"
 
 It fails due a 404 on the dependencies:
 
 Extracting structure failed
 stack trace is suppressed; run last update for the full output
 stack trace is suppressed; run last ssExtractDependencies for the full
 output
 (update) sbt.librarymanagement.ResolveException: Error downloading
 org.restlet.jee:org.restlet:2.4.0
 Not found
 Not found
 not found:
 /home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml
 not found:
 
>>> 
>> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom
 Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0
 Not found
 Not found
 not found:
 
>>> 
>> /home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml
 not found:
 
>>> 
>> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom
 (ssExtractDependencies) sbt.librarymanagement.ResolveException: Error
 downloading org.restlet.jee:org.restlet:2.4.0
 Not found
 Not found
 not found:
 /home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml
 not found:
 
>>> 
>> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom
 Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0
 Not found
 Not found
 not found:
 
>>> 
>> /home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml
 not found:
 
>>> 
>> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom
 
 
 
 Any ideas? Do I need to add a specific repository to get it to compile?
>>> 
>>> 
>> 



Out of memory errors with Spatial indexing

2020-07-03 Thread Sunil Varma
We are seeing OOM errors  when trying to index some spatial data. I believe
the data itself might not be valid but it shouldn't cause the Server to
crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is
causing the error.

{
"id": "bad_data_1",
"spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
1.000150474662E30)"
}

Above dynamic field is mapped to field type "location_rpt" (
solr.SpatialRecursivePrefixTreeFieldType).

  Any pointers to get around this issue would be highly appreciated.

Thanks!


Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Erick Erickson
If you’re seeing low CPU utilization at the same time, you probably
just have too much data on too little hardware. Check your
swapping, how much of your I/O is just because Lucene can’t
hold all the parts of the index it needs in memory at once? Lucene
uses MMapDirectory to hold the index and you may well be
swapping, see:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

But my guess is that you’ve just reached a tipping point. You say:

"From last 2-3 weeks we have been noticing either slow indexing or timeout 
errors while indexing”

So have you been continually adding more documents to your
collections for more than the 2-3 weeks? If so you may have just
put so much data on the same boxes that you’ve gone over
the capacity of your hardware. As Toke says, adding physical
memory for the OS to use to hold relevant parts of the index may
alleviate the problem (again, refer to Uwe’s article for why).

All that said, if you’re going to keep adding document you need to
seriously think about adding new machines and moving some of
your replicas to them.

Best,
Erick

> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
> 
> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>> We are performing QA performance testing on couple of collections
>> which holds 2 billion and 3.5 billion docs respectively.
> 
> How many shards?
> 
>>  1.  Our performance team noticed that read operations are pretty
>> more than write operations like 100:1 ratio, is this expected during
>> indexing or solr nodes are doing any other operations like syncing?
> 
> Are you saying that there are 100 times more read operations when you
> are indexing? That does not sound too unrealistic as the disk cache
> might be filled with the data that the writers are flushing.
> 
> In that case, more RAM would help. Okay, more RAM nearly always helps,
> but such massive difference in IO-utilization does indicate that you
> are starved for cache.
> 
> I noticed you have at least 18 replicas. That's a lot. Just to sanity
> check: How many replicas are each physical box handling? If they are
> sharing resources, fewer replicas would probably be better.
> 
>>  3.  Our client timeout is set to 2mins, can they increase further
>> more? Would that help or create any other problems?
> 
> It does not hurt the server to increase the client timeout as the
> initiated query will keep running until it is finished, independent of
> whether or not there is a client to receive the result.
> 
> If you want a better max time for query processing, you should look at 
> 
> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
> but due to its inherent limitations it might not help in your
> situation.
> 
>>  4.  When we created an empty collection and loaded same data file,
>> it loaded fine without any issues so having more documents in a
>> collection would create such problems?
> 
> Solr 7 does have a problem with sparse DocValues and many documents,
> leading to excessive IO-activity, which might be what you are seeing. I
> can see from an earlier post that you were using streaming expressions
> for another collection: This is one of the things that are affected by
> the Solr 7 DocValues issue.
> 
> More info about DocValues and streaming:
> https://issues.apache.org/jira/browse/SOLR-13013
> 
> Fairly in-depth info on the problem with Solr 7 docValues:
> https://issues.apache.org/jira/browse/LUCENE-8374
> 
> If this is your problem, upgrading to Solr 8 and indexing the
> collection from scratch should fix it. 
> 
> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
> or you can ensure that there are values defined for all DocValues-
> fields in all your documents.
> 
>> java.net.SocketTimeoutException: Read timed out
>>at java.net.SocketInputStream.socketRead0(Native Method) 
> ...
>> Remote error message: java.util.concurrent.TimeoutException: Idle
>> timeout expired: 60/60 ms
> 
> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
> should be able to change it in solr.xml.
> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
> 
> BUT if an update takes > 10 minutes to be processed, it indicates that
> the cluster is overloaded.  Increasing the timeout is just a band-aid.
> 
> - Toke Eskildsen, Royal Danish Library
> 
> 



Re: Solr Float/Double multivalues fields

2020-07-03 Thread Toke Eskildsen
On Fri, 2020-07-03 at 10:00 +0200, Vincenzo D'Amore wrote:
> Hi Erick, not sure I got.
> Does this mean that the order of values within a multivalued field:
> - docValues=true the result will be both re-ordered and deduplicated.
> - docValues=false the result order is guaranteed to be maintained for
> values in the insertion-order.
> 
> Is this correct?

Sorta, but it is not the complete picture. Things gets complicated when
you mix it with stored, so that you have "stored=true docValues=true".
There's an article about that at

https://sease.io/2020/03/docvalues-vs-stored-fields-apache-solr-features-and-performance-smackdown.html

BTW: The documentation should definitely mention that stored preserves
order & duplicates. It is not obvious.

- Toke Eskildsen, Royal Danish Library




Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Toke Eskildsen
On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
> We are performing QA performance testing on couple of collections
> which holds 2 billion and 3.5 billion docs respectively.

How many shards?

>   1.  Our performance team noticed that read operations are pretty
> more than write operations like 100:1 ratio, is this expected during
> indexing or solr nodes are doing any other operations like syncing?

Are you saying that there are 100 times more read operations when you
are indexing? That does not sound too unrealistic as the disk cache
might be filled with the data that the writers are flushing.

In that case, more RAM would help. Okay, more RAM nearly always helps,
but such massive difference in IO-utilization does indicate that you
are starved for cache.

I noticed you have at least 18 replicas. That's a lot. Just to sanity
check: How many replicas are each physical box handling? If they are
sharing resources, fewer replicas would probably be better.

>   3.  Our client timeout is set to 2mins, can they increase further
> more? Would that help or create any other problems?

It does not hurt the server to increase the client timeout as the
initiated query will keep running until it is finished, independent of
whether or not there is a client to receive the result.

If you want a better max time for query processing, you should look at 

https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
 but due to its inherent limitations it might not help in your
situation.

>   4.  When we created an empty collection and loaded same data file,
> it loaded fine without any issues so having more documents in a
> collection would create such problems?

Solr 7 does have a problem with sparse DocValues and many documents,
leading to excessive IO-activity, which might be what you are seeing. I
can see from an earlier post that you were using streaming expressions
for another collection: This is one of the things that are affected by
the Solr 7 DocValues issue.

More info about DocValues and streaming:
https://issues.apache.org/jira/browse/SOLR-13013

Fairly in-depth info on the problem with Solr 7 docValues:
https://issues.apache.org/jira/browse/LUCENE-8374

If this is your problem, upgrading to Solr 8 and indexing the
collection from scratch should fix it. 

Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
or you can ensure that there are values defined for all DocValues-
fields in all your documents.

> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method) 
...
> Remote error message: java.util.concurrent.TimeoutException: Idle
> timeout expired: 60/60 ms

There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
should be able to change it in solr.xml.
https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html

BUT if an update takes > 10 minutes to be processed, it indicates that
the cluster is overloaded.  Increasing the timeout is just a band-aid.

- Toke Eskildsen, Royal Danish Library




Re: ***URGENT***Re: Questions about Solr Search

2020-07-03 Thread Dave
Seriously. Doug answered all of your questions. 

> On Jul 3, 2020, at 6:12 AM, Atri Sharma  wrote:
> 
> Please do not cross post. I believe your questions were already answered?
> 
>> On Fri, Jul 3, 2020 at 3:08 PM Gautam K  wrote:
>> 
>> Since it's a bit of an urgent request so if could please help me on this by 
>> today it will be highly appreciated.
>> 
>> Thanks & Regards,
>> Gautam Kanaujia
>> 
>>> On Thu, Jul 2, 2020 at 7:49 PM Gautam K  wrote:
>>> 
>>> Dear Team,
>>> 
>>> Hope you all are doing well.
>>> 
>>> Can you please help with the following question? We are using Solr search 
>>> in our Organisation and now checking whether Solr provides search 
>>> capabilities like Google Enterprise search(Google Knowledge Graph Search).
>>> 
>>> 1, Does Solr Search provide Voice Search like Google?
>>> 2. Does Solar Search provide NLP Search(Natural Language Processing)?
>>> 3. Does Solr have all the capabilities which Google Knowledge Graph 
>>> provides like below?
>>> 
>>> Getting a ranked list of the most notable entities that match certain 
>>> criteria.
>>> Predictively completing entities in a search box.
>>> Annotating/organizing content using the Knowledge Graph entities.
>>> 
>>> 
>>> Your help will be appreciated highly.
>>> 
>>> Many thanks
>>> Gautam Kanaujia
>>> India
> 
> -- 
> Regards,
> 
> Atri
> Apache Concerted


Re: ***URGENT***Re: Questions about Solr Search

2020-07-03 Thread Atri Sharma
Please do not cross post. I believe your questions were already answered?

On Fri, Jul 3, 2020 at 3:08 PM Gautam K  wrote:
>
> Since it's a bit of an urgent request so if could please help me on this by 
> today it will be highly appreciated.
>
> Thanks & Regards,
> Gautam Kanaujia
>
> On Thu, Jul 2, 2020 at 7:49 PM Gautam K  wrote:
>>
>> Dear Team,
>>
>> Hope you all are doing well.
>>
>> Can you please help with the following question? We are using Solr search in 
>> our Organisation and now checking whether Solr provides search capabilities 
>> like Google Enterprise search(Google Knowledge Graph Search).
>>
>> 1, Does Solr Search provide Voice Search like Google?
>> 2. Does Solar Search provide NLP Search(Natural Language Processing)?
>> 3. Does Solr have all the capabilities which Google Knowledge Graph provides 
>> like below?
>>
>> Getting a ranked list of the most notable entities that match certain 
>> criteria.
>> Predictively completing entities in a search box.
>> Annotating/organizing content using the Knowledge Graph entities.
>>
>>
>> Your help will be appreciated highly.
>>
>> Many thanks
>> Gautam Kanaujia
>> India

-- 
Regards,

Atri
Apache Concerted


Re: How to use two search string in a single solr query

2020-07-03 Thread Tushar Arora
Hi
Thanks Erick and Walter for your response.

Solr Version Used : 6.5.0
I tried to elaborate the issue:

Case 1 : Search String : Industrial Electric Oven
  Results=945
Case 2 : Search String : Dell laptop bags
  Results=992

In above both cases, mm play its role.(match any 2 words out of 3)

Now, I want to search with both string and mm still playing its role.

q=(Industrial Electric Oven) OR (Dell laptop bags)
I want mm still play its role in matching two out of three words in both
the cases.
Ex: Documents containing electric oven, industrial oven, dell bags, laptop
bags should be returned.
 I don't want the documents containing only dell, bags,etc. Also no
document containing electric bags.

Regards,
Tushar Arora


On Thu, 2 Jul 2020 at 22:37, Walter Underwood  wrote:

> First, remove the “mm” parameter from the request handler definition. That
> can
> be added back in and tweaked later, or just left out.
>
> Second, you don’t need any query syntax to search for two words. This
> query
> should work fine:
>
>   books bags
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 1, 2020, at 10:22 PM, Tushar Arora  wrote:
> >
> > Hi,
> > I have a scenario with following entry in the request handler(handler1)
> of
> > solrconfig.xml.(defType=edismax is used)
> > description category  > "qf">title^4 demand^0.3
> > 2-1 4-30%
> >
> > When I searched 'bags' as a search string, solr returned 15000 results.
> > Query Used :
> >
> http://localhost:8984/solr/core_name/select?fl=title=on=bags=handler1=10=json
> >
> > And when searched 'books' as a search string, solr returns say 3348
> results.
> > Query Used :
> >
> http://localhost:8984/solr/core_name/select?fl=title=on=books=handler1=10=json
> >
> > I want to use both 'bags' and 'books' as a search string in a single
> query.
> > I used the following query:
> >
> http://localhost:8984/solr/core_name/select?fl=title=on=%22bags%22+OR+%22books%22=handler1=10=json
> > But OR operator not working. It will only give 7 results.
> >
> >
> > I even tried this :
> >
> http://localhost:8984/solr/core_name/select?fl=title=on=(bags)+OR+(books)=handler1=10=json
> > But it also gives 7 results.
> >
> > But my concern is to include the result of both 'bags' OR 'books' in a
> > single query.
> > Is there any way to use two search strings in a single query?
>
>


Re: Solr Float/Double multivalues fields

2020-07-03 Thread Vincenzo D'Amore
Hi Erick, not sure I got.
Does this mean that the order of values within a multivalued field:
- docValues=true the result will be both re-ordered and deduplicated.
- docValues=false the result order is guaranteed to be maintained for
values in the insertion-order.

Is this correct?

On Thu, Jul 2, 2020 at 8:37 PM Erick Erickson 
wrote:

> This is true _unless_ you fetch from docValues. docValues are SORTED_SETs,
> so the results will be both ordered and deduplicated if you return them
> as part of the field list.
>
> Don’t really think it needs to go into the ref guide, it’s just inherent
> in storing
> any kind of value. You wouldn’t expect multiple text entries in a
> multiValued
> field to be rearranged when returning the stored values either.
>
> Best,
> Erick
>
> > On Jul 2, 2020, at 2:21 PM, Vincenzo D'Amore  wrote:
> >
> > Thanks, and genuinely asking: is there written somewhere in the
> > documentation too? If no, could anyone suggest to me which doc page
> should
> > I try to update?
> >
> > On Thu, Jul 2, 2020 at 8:08 PM Colvin Cowie 
> > wrote:
> >
> >> The order of values within a multivalued field should match the
> insertion
> >> order. -- we certainly rely on that in our product.
> >>
> >> Order is guaranteed to be maintained for values in a multi-valued field.
> >>>
> >>
> >>
> https://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4028057.html
> >>
> >> On Thu, 2 Jul 2020 at 18:52, Vincenzo D'Amore 
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> simple question: Solr float/double multivalue fields preserve the order
> >> of
> >>> inserted values?
> >>>
> >>> Best regards,
> >>> Vincenzo
> >>>
> >>> --
> >>> Vincenzo D'Amore
> >>>
> >>
> >
> >
> > --
> > Vincenzo D'Amore
>
>

-- 
Vincenzo D'Amore


Changing Response for Group Query - Custom Request Handler

2020-07-03 Thread dnz
Dear Community, 

I am currently working on a Solr Custom Plugin, which - for a group query - 
adds both total matches and number of groups to the response and also keeps the 
response format as if it is not a group query. One additional requirement is 
that numFound should contain the number of groups instead of total matches. 
This whole thing is required since we have some limitations on the consumer 
side and want to keep the same response format.

If I use group.main=true, I get the results in "usual" format, but the 
group.ngroups=true does not effect anymore. So, I decided to create a custom 
request handler in order to get number of groups and create the response format 
which we need for the requester. This format is basically:

---{
//...
"response": {
    "numFound": 25,  // contains number of grouped results --> for pagination 
reasons
    "total": 401 // contains number of total results
    "start": 0,
    "docs": [
  {
    "id": "26207825"
  },
// ...
}

---

My first question would be: is this a good approach with a custom request 
handler, or is there a better/easier approach to achieve this.

I went ahead and (tried) to implement the custom request handler. From a group 
request (group.format=simple&=true), I can get a "docList" (or 
docSlice) from the "grouped" part, but basically putting this docList to the 
response gives me the same result with group.main=true. Therefore I thought, I 
would create a new docSlice with mostly the content from the old docSlice and 
put this in the response. That works for start=0, but as soon as I start 
pagination and set start anything else than zero, I get an error. Because, even 
though the start is not zero, the docs[] holds skipped results documents to, 
but with document iterator I can only access a subset of it, so my new docSlice 
would only contain a subset of documents. Here is the method for creating a new 
docSlice:

---private DocSlice getGroupedDocList(SimpleOrderedMap grouped, int numGroups) {

    DocSlice docSlice = (DocSlice) grouped.get("doclist");
    int offset = docSlice.offset();
    int docSize = docSlice.size();
    int len = offset + docSize;

    boolean hasScore = docSlice.hasScores();

    int[] docs = new int[len];
    float[] scores = new float[len];

    DocIterator iterator = docSlice.iterator();
    int i = 0;
    while (i < len && iterator.hasNext()) {
    docs[i] = iterator.nextDoc();
    LOGGER.error("Doc added: " + docs[i]);
    if (hasScore) { scores[i] = iterator.score(); }
    i++;
    }
    long matches = numGroups;
    float maxScore = docSlice.maxScore();

    DocSlice newDocSlice = new DocSlice(offset, len, docs, scores, matches, 
maxScore);
    return newDocSlice;
}
---

I am kind of stuck at this point. Can someone maybe help me?

Best regards,
Deniz C.

Re: Solr Float/Double multivalues fields

2020-07-03 Thread Thomas Corthals
I think this should go in the ref guide. If your product depends on this
behaviour, you want reassurance that it isn't going to change in the next
release. Not everyone will go looking through the javadoc to see if this is
implied.

Typically it'll either be something like "are always returned in insertion
order" or "are currently returned in insertion order, but your code
shouldn't rely on this behaviour because it can change in future releases".
That's usually sufficient to make an informed decision on how to handle
returned values.

If it's different for docValues, that's even more reason to state it
clearly in the ref guide to avoid confusion.

Best,
Thomas

Op do 2 jul. 2020 om 20:37 schreef Erick Erickson :

> This is true _unless_ you fetch from docValues. docValues are SORTED_SETs,
> so the results will be both ordered and deduplicated if you return them
> as part of the field list.
>
> Don’t really think it needs to go into the ref guide, it’s just inherent
> in storing
> any kind of value. You wouldn’t expect multiple text entries in a
> multiValued
> field to be rearranged when returning the stored values either.
>
> Best,
> Erick
>
> > On Jul 2, 2020, at 2:21 PM, Vincenzo D'Amore  wrote:
> >
> > Thanks, and genuinely asking: is there written somewhere in the
> > documentation too? If no, could anyone suggest to me which doc page
> should
> > I try to update?
> >
> > On Thu, Jul 2, 2020 at 8:08 PM Colvin Cowie 
> > wrote:
> >
> >> The order of values within a multivalued field should match the
> insertion
> >> order. -- we certainly rely on that in our product.
> >>
> >> Order is guaranteed to be maintained for values in a multi-valued field.
> >>>
> >>
> >>
> https://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4028057.html
> >>
> >> On Thu, 2 Jul 2020 at 18:52, Vincenzo D'Amore 
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> simple question: Solr float/double multivalue fields preserve the order
> >> of
> >>> inserted values?
> >>>
> >>> Best regards,
> >>> Vincenzo
> >>>
> >>> --
> >>> Vincenzo D'Amore
> >>>
> >>
> >
> >
> > --
> > Vincenzo D'Amore
>
>


RE: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Kommu, Vinodh K.
Anyone has any thoughts or suggestions on this issue?

Thanks & Regards,
Vinodh

From: Kommu, Vinodh K.
Sent: Thursday, July 2, 2020 4:46 PM
To: solr-user@lucene.apache.org
Subject: Time-out errors while indexing (Solr 7.7.1)

Hi,

We are performing QA performance testing on couple of collections which holds 2 
billion and 3.5 billion docs respectively. Indexing happens from a separate 
client using solrJ which uses 10 thread and batch size 1000. From last 2-3 
weeks we have been noticing either slow indexing or timeout errors while 
indexing. As part of troubleshooting, we did noticed that when peak disk IO 
utilization is reaching higher side, then indexing is happening slowly and when 
disk IO is constantly near 100%, timeout issues are observed.

Few questions here:


  1.  Our performance team noticed that read operations are pretty more than 
write operations like 100:1 ratio, is this expected during indexing or solr 
nodes are doing any other operations like syncing?
  2.  Zookeeper has a latency around (min/avg/max: 0/0/2205), can this latency 
create instabilities issues to ZK or Solr clusters? Or impact indexing or 
searching operations?
  3.  Our client timeout is set to 2mins, can they increase further more? Would 
that help or create any other problems?
  4.  When we created an empty collection and loaded same data file, it loaded 
fine without any issues so having more documents in a collection would create 
such problems?

Any suggestions or feedback would be really appreciated.

Solr version - 7.7.1

Time out error snippet:

ERROR 
(updateExecutor-3-thread-30055-processing-x:TestCollection_shard5_replica_n18 
https:localhost:1122//solr//TestCollection_shard6_replica_n22
 r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) 
[c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
~[?:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:171) 
~[?:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:141) 
~[?:1.8.0_212]
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) 
~[?:1.8.0_212]
at sun.security.ssl.InputRecord.read(InputRecord.java:503) 
~[?:1.8.0_212]
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) 
~[?:1.8.0_212]
at 
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) 
~[?:1.8.0_212]
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) 
~[?:1.8.0_212]
at 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
 ~[solr-core-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 
2019-02-23 02:39:07]
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6]