Re: unified highlighter performance in solr 8.5.1
I doubt that WORD mode is impacted much by hl.fragsizeIsMinimum in terms of quality of the highlight since there are vastly more breaks to pick from. I think that setting is more useful in SENTENCE mode if you can stand the perf hit. If you agree, then why not just let this one default to "true"? We agree on better documenting the perf trade-off. Thanks again for working on these settings, BTW. ~ David On Fri, Jul 3, 2020 at 1:25 PM Nándor Mátravölgyi wrote: > Since the issue seems to be affecting the highlighter differently > based on which mode it is using, having different defaults for the > modes could be explored. > > WORD may have the new defaults as it has little effect on performance > and it creates nicer highlights. > SENTENCE should have the defaults that produce reasonable performance. > The docs could document this while also mentioning that the UH's > performance is highly dependent on the underlying Java String/Text? > Iterator. > > One can argue that having different defaults based on mode is > confusing. In this case I think the defaults should be changed to have > the SENTENCE mode perform better. Maybe the options for nice > highlights with WORD mode could be put into the docs in this case as > some form of an example. > > As long as I can use the UH with nicely aligned snippets in WORD mode > I'm fine with any defaults. I explicitly set them in the config and in > the queries most of the time anyways. >
Re: Out of memory errors with Spatial indexing
Hi Sunil, Your shape is at a pole, and I'm aware of a bug causing an exponential explosion of needed grid squares when you have polygons super-close to the pole. Might you try S2PrefixTree instead? I forget if this would fix it or not by itself. For indexing non-point data, I recommend class="solr.RptWithGeometrySpatialField" which internally is based off a combination of a course grid and storing the original vector geometry for accurate verification: The internally coarser grid will lessen the impact of that pole bug. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma wrote: > We are seeing OOM errors when trying to index some spatial data. I believe > the data itself might not be valid but it shouldn't cause the Server to > crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is > causing the error. > > { > "id": "bad_data_1", > "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0 > 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30, > 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0 > 1.000150474662E30)" > } > > Above dynamic field is mapped to field type "location_rpt" ( > solr.SpatialRecursivePrefixTreeFieldType). > > Any pointers to get around this issue would be highly appreciated. > > Thanks! >
Re: Time-out errors while indexing (Solr 7.7.1)
Oops, I transposed that. If your index is a terabyte and your RAM is 128M, _that’s_ a red flag. > On Jul 3, 2020, at 5:53 PM, Erick Erickson wrote: > > You haven’t said how many _shards_ are present. Nor how many replicas of the > collection you’re hosting per physical machine. Nor how large the indexes are > on disk. Those are the numbers that count. The latter is somewhat fuzzy, but > if your aggregate index size on a machine with, say, 128G of memory is a > terabyte, that’s a red flag. > > Short form, though is yes. Subject to the questions above, this is what I’d > be looking at first. > > And, as I said, if you’ve been steadily increasing the total number of > documents, you’ll reach a tipping point sometime. > > Best, > Erick > >> On Jul 3, 2020, at 5:32 PM, Mad have wrote: >> >> Hi Eric, >> >> The collection has almost 13billion documents with each document around 5kb >> size, all the columns around 150 are the indexed. Do you think that number >> of documents in the collection causing this issue. Appreciate your response. >> >> Regards, >> Madhava >> >> Sent from my iPhone >> >>> On 3 Jul 2020, at 12:42, Erick Erickson wrote: >>> >>> If you’re seeing low CPU utilization at the same time, you probably >>> just have too much data on too little hardware. Check your >>> swapping, how much of your I/O is just because Lucene can’t >>> hold all the parts of the index it needs in memory at once? Lucene >>> uses MMapDirectory to hold the index and you may well be >>> swapping, see: >>> >>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >>> >>> But my guess is that you’ve just reached a tipping point. You say: >>> >>> "From last 2-3 weeks we have been noticing either slow indexing or timeout >>> errors while indexing” >>> >>> So have you been continually adding more documents to your >>> collections for more than the 2-3 weeks? If so you may have just >>> put so much data on the same boxes that you’ve gone over >>> the capacity of your hardware. As Toke says, adding physical >>> memory for the OS to use to hold relevant parts of the index may >>> alleviate the problem (again, refer to Uwe’s article for why). >>> >>> All that said, if you’re going to keep adding document you need to >>> seriously think about adding new machines and moving some of >>> your replicas to them. >>> >>> Best, >>> Erick >>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen wrote: > On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote: > We are performing QA performance testing on couple of collections > which holds 2 billion and 3.5 billion docs respectively. How many shards? > 1. Our performance team noticed that read operations are pretty > more than write operations like 100:1 ratio, is this expected during > indexing or solr nodes are doing any other operations like syncing? Are you saying that there are 100 times more read operations when you are indexing? That does not sound too unrealistic as the disk cache might be filled with the data that the writers are flushing. In that case, more RAM would help. Okay, more RAM nearly always helps, but such massive difference in IO-utilization does indicate that you are starved for cache. I noticed you have at least 18 replicas. That's a lot. Just to sanity check: How many replicas are each physical box handling? If they are sharing resources, fewer replicas would probably be better. > 3. Our client timeout is set to 2mins, can they increase further > more? Would that help or create any other problems? It does not hurt the server to increase the client timeout as the initiated query will keep running until it is finished, independent of whether or not there is a client to receive the result. If you want a better max time for query processing, you should look at https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter but due to its inherent limitations it might not help in your situation. > 4. When we created an empty collection and loaded same data file, > it loaded fine without any issues so having more documents in a > collection would create such problems? Solr 7 does have a problem with sparse DocValues and many documents, leading to excessive IO-activity, which might be what you are seeing. I can see from an earlier post that you were using streaming expressions for another collection: This is one of the things that are affected by the Solr 7 DocValues issue. More info about DocValues and streaming: https://issues.apache.org/jira/browse/SOLR-13013 Fairly in-depth info on the problem with Solr 7 docValues: https://issues.apache.org/jira/browse/LUCENE-8374 If this is your problem, upgrading to Solr 8 and indexing the >
Re: Time-out errors while indexing (Solr 7.7.1)
You haven’t said how many _shards_ are present. Nor how many replicas of the collection you’re hosting per physical machine. Nor how large the indexes are on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if your aggregate index size on a machine with, say, 128G of memory is a terabyte, that’s a red flag. Short form, though is yes. Subject to the questions above, this is what I’d be looking at first. And, as I said, if you’ve been steadily increasing the total number of documents, you’ll reach a tipping point sometime. Best, Erick > On Jul 3, 2020, at 5:32 PM, Mad have wrote: > > Hi Eric, > > The collection has almost 13billion documents with each document around 5kb > size, all the columns around 150 are the indexed. Do you think that number of > documents in the collection causing this issue. Appreciate your response. > > Regards, > Madhava > > Sent from my iPhone > >> On 3 Jul 2020, at 12:42, Erick Erickson wrote: >> >> If you’re seeing low CPU utilization at the same time, you probably >> just have too much data on too little hardware. Check your >> swapping, how much of your I/O is just because Lucene can’t >> hold all the parts of the index it needs in memory at once? Lucene >> uses MMapDirectory to hold the index and you may well be >> swapping, see: >> >> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >> >> But my guess is that you’ve just reached a tipping point. You say: >> >> "From last 2-3 weeks we have been noticing either slow indexing or timeout >> errors while indexing” >> >> So have you been continually adding more documents to your >> collections for more than the 2-3 weeks? If so you may have just >> put so much data on the same boxes that you’ve gone over >> the capacity of your hardware. As Toke says, adding physical >> memory for the OS to use to hold relevant parts of the index may >> alleviate the problem (again, refer to Uwe’s article for why). >> >> All that said, if you’re going to keep adding document you need to >> seriously think about adding new machines and moving some of >> your replicas to them. >> >> Best, >> Erick >> >>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen wrote: >>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote: We are performing QA performance testing on couple of collections which holds 2 billion and 3.5 billion docs respectively. >>> >>> How many shards? >>> 1. Our performance team noticed that read operations are pretty more than write operations like 100:1 ratio, is this expected during indexing or solr nodes are doing any other operations like syncing? >>> >>> Are you saying that there are 100 times more read operations when you >>> are indexing? That does not sound too unrealistic as the disk cache >>> might be filled with the data that the writers are flushing. >>> >>> In that case, more RAM would help. Okay, more RAM nearly always helps, >>> but such massive difference in IO-utilization does indicate that you >>> are starved for cache. >>> >>> I noticed you have at least 18 replicas. That's a lot. Just to sanity >>> check: How many replicas are each physical box handling? If they are >>> sharing resources, fewer replicas would probably be better. >>> 3. Our client timeout is set to 2mins, can they increase further more? Would that help or create any other problems? >>> >>> It does not hurt the server to increase the client timeout as the >>> initiated query will keep running until it is finished, independent of >>> whether or not there is a client to receive the result. >>> >>> If you want a better max time for query processing, you should look at >>> >>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter >>> but due to its inherent limitations it might not help in your >>> situation. >>> 4. When we created an empty collection and loaded same data file, it loaded fine without any issues so having more documents in a collection would create such problems? >>> >>> Solr 7 does have a problem with sparse DocValues and many documents, >>> leading to excessive IO-activity, which might be what you are seeing. I >>> can see from an earlier post that you were using streaming expressions >>> for another collection: This is one of the things that are affected by >>> the Solr 7 DocValues issue. >>> >>> More info about DocValues and streaming: >>> https://issues.apache.org/jira/browse/SOLR-13013 >>> >>> Fairly in-depth info on the problem with Solr 7 docValues: >>> https://issues.apache.org/jira/browse/LUCENE-8374 >>> >>> If this is your problem, upgrading to Solr 8 and indexing the >>> collection from scratch should fix it. >>> >>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7 >>> or you can ensure that there are values defined for all DocValues- >>> fields in all your documents. >>> java.net.SocketTimeoutException: Read timed
Re: Time-out errors while indexing (Solr 7.7.1)
Hi Eric, The collection has almost 13billion documents with each document around 5kb size, all the columns around 150 are the indexed. Do you think that number of documents in the collection causing this issue. Appreciate your response. Regards, Madhava Sent from my iPhone > On 3 Jul 2020, at 12:42, Erick Erickson wrote: > > If you’re seeing low CPU utilization at the same time, you probably > just have too much data on too little hardware. Check your > swapping, how much of your I/O is just because Lucene can’t > hold all the parts of the index it needs in memory at once? Lucene > uses MMapDirectory to hold the index and you may well be > swapping, see: > > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > But my guess is that you’ve just reached a tipping point. You say: > > "From last 2-3 weeks we have been noticing either slow indexing or timeout > errors while indexing” > > So have you been continually adding more documents to your > collections for more than the 2-3 weeks? If so you may have just > put so much data on the same boxes that you’ve gone over > the capacity of your hardware. As Toke says, adding physical > memory for the OS to use to hold relevant parts of the index may > alleviate the problem (again, refer to Uwe’s article for why). > > All that said, if you’re going to keep adding document you need to > seriously think about adding new machines and moving some of > your replicas to them. > > Best, > Erick > >> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen wrote: >> >>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote: >>> We are performing QA performance testing on couple of collections >>> which holds 2 billion and 3.5 billion docs respectively. >> >> How many shards? >> >>> 1. Our performance team noticed that read operations are pretty >>> more than write operations like 100:1 ratio, is this expected during >>> indexing or solr nodes are doing any other operations like syncing? >> >> Are you saying that there are 100 times more read operations when you >> are indexing? That does not sound too unrealistic as the disk cache >> might be filled with the data that the writers are flushing. >> >> In that case, more RAM would help. Okay, more RAM nearly always helps, >> but such massive difference in IO-utilization does indicate that you >> are starved for cache. >> >> I noticed you have at least 18 replicas. That's a lot. Just to sanity >> check: How many replicas are each physical box handling? If they are >> sharing resources, fewer replicas would probably be better. >> >>> 3. Our client timeout is set to 2mins, can they increase further >>> more? Would that help or create any other problems? >> >> It does not hurt the server to increase the client timeout as the >> initiated query will keep running until it is finished, independent of >> whether or not there is a client to receive the result. >> >> If you want a better max time for query processing, you should look at >> >> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter >> but due to its inherent limitations it might not help in your >> situation. >> >>> 4. When we created an empty collection and loaded same data file, >>> it loaded fine without any issues so having more documents in a >>> collection would create such problems? >> >> Solr 7 does have a problem with sparse DocValues and many documents, >> leading to excessive IO-activity, which might be what you are seeing. I >> can see from an earlier post that you were using streaming expressions >> for another collection: This is one of the things that are affected by >> the Solr 7 DocValues issue. >> >> More info about DocValues and streaming: >> https://issues.apache.org/jira/browse/SOLR-13013 >> >> Fairly in-depth info on the problem with Solr 7 docValues: >> https://issues.apache.org/jira/browse/LUCENE-8374 >> >> If this is your problem, upgrading to Solr 8 and indexing the >> collection from scratch should fix it. >> >> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7 >> or you can ensure that there are values defined for all DocValues- >> fields in all your documents. >> >>> java.net.SocketTimeoutException: Read timed out >>> at java.net.SocketInputStream.socketRead0(Native Method) >> ... >>> Remote error message: java.util.concurrent.TimeoutException: Idle >>> timeout expired: 60/60 ms >> >> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You >> should be able to change it in solr.xml. >> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html >> >> BUT if an update takes > 10 minutes to be processed, it indicates that >> the cluster is overloaded. Increasing the timeout is just a band-aid. >> >> - Toke Eskildsen, Royal Danish Library >> >> >
Re: unified highlighter performance in solr 8.5.1
Since the issue seems to be affecting the highlighter differently based on which mode it is using, having different defaults for the modes could be explored. WORD may have the new defaults as it has little effect on performance and it creates nicer highlights. SENTENCE should have the defaults that produce reasonable performance. The docs could document this while also mentioning that the UH's performance is highly dependent on the underlying Java String/Text? Iterator. One can argue that having different defaults based on mode is confusing. In this case I think the defaults should be changed to have the SENTENCE mode perform better. Maybe the options for nice highlights with WORD mode could be put into the docs in this case as some form of an example. As long as I can use the UH with nicely aligned snippets in WORD mode I'm fine with any defaults. I explicitly set them in the config and in the queries most of the time anyways.
Re: unified highlighter performance in solr 8.5.1
I think we should flip the default of hl.fragsizeIsMinimum to be 'true', thus have the behavior close to what preceded 8.5. (a) it was very recently (<= 8.4) the previous behavior and so may require less tuning for users in 8.6 henceforth (b) it's significantly faster for long text -- seems to be 2x to 5x for long documents (assuming no change in hl.fragAlignRatio). If the user additionally configures hl.fragAlignRatio to 0 (also the previous behavior; 0.5 is the new default), I saw another 6x on top of that for "doc3" in the test data Michal prepared. Although I like that the sizing looks nicer, I think that is more from the introduction and new default of hl.fragAlignRatio=0.5 than it is hl.fragsizeIsMinimum=false. We might even consider lowering hl.fragAlignRatio to say 0.3 and retain pretty reasonable highlights (avoids the extreme cases occurring with '0') and additional performance benefit from that. What do you think Nandor, Michal? I'm hoping a change in settings (+ some better notes/docs on this) could slip into an 8.6, all done by myself ASAP. ~ David On Fri, Jun 19, 2020 at 2:32 PM Nándor Mátravölgyi wrote: > Hi! > > With the provided test I've profiled the preceding() and following() > calls on the base Java iterators in the different options. > > === default highlighter arguments === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 1130 calls of > baseIter.preceding() took 1.039629 seconds in total > - from LengthGoalBreakIterator.following(): 1140 calls of > baseIter.following() took 0.340679 seconds in total > - from LengthGoalBreakIterator.preceding(): 1150 calls of > baseIter.preceding() took 0.099344 seconds in total > - from LengthGoalBreakIterator.preceding(): 1100 calls of > baseIter.following() took 0.015156 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 1200 calls of > baseIter.preceding() took 0.001006 seconds in total > - from LengthGoalBreakIterator.following(): 1700 calls of > baseIter.following() took 0.006278 seconds in total > - from LengthGoalBreakIterator.preceding(): 1710 calls of > baseIter.preceding() took 0.016320 seconds in total > - from LengthGoalBreakIterator.preceding(): 1090 calls of > baseIter.following() took 0.000527 seconds in total > > === hl.fragsizeIsMinimum=true&hl.fragAlignRatio=0 === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 860 calls of > baseIter.following() took 0.012593 seconds in total > - from LengthGoalBreakIterator.preceding(): 870 calls of > baseIter.preceding() took 0.022208 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 1360 calls of > baseIter.following() took 0.004789 seconds in total > - from LengthGoalBreakIterator.preceding(): 1370 calls of > baseIter.preceding() took 0.015983 seconds in total > > === hl.fragsizeIsMinimum=true === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 980 calls of > baseIter.following() took 0.010253 seconds in total > - from LengthGoalBreakIterator.preceding(): 980 calls of > baseIter.preceding() took 0.341997 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 1670 calls of > baseIter.following() took 0.005150 seconds in total > - from LengthGoalBreakIterator.preceding(): 1680 calls of > baseIter.preceding() took 0.013657 seconds in total > > === hl.fragAlignRatio=0 === > Calling the test query with SENTENCE base iterator: > - from LengthGoalBreakIterator.following(): 1070 calls of > baseIter.preceding() took 1.312056 seconds in total > - from LengthGoalBreakIterator.following(): 1080 calls of > baseIter.following() took 0.678575 seconds in total > - from LengthGoalBreakIterator.preceding(): 1080 calls of > baseIter.preceding() took 0.020507 seconds in total > - from LengthGoalBreakIterator.preceding(): 1080 calls of > baseIter.following() took 0.006977 seconds in total > > Calling the test query with WORD base iterator: > - from LengthGoalBreakIterator.following(): 880 calls of > baseIter.preceding() took 0.000706 seconds in total > - from LengthGoalBreakIterator.following(): 1370 calls of > baseIter.following() took 0.004110 seconds in total > - from LengthGoalBreakIterator.preceding(): 1380 calls of > baseIter.preceding() took 0.014752 seconds in total > - from LengthGoalBreakIterator.preceding(): 1380 calls of > baseIter.following() took 0.000106 seconds in total > > There is definitely a big difference between SENTENCE and WORD. I'm > not sure how we can improve the logic on our side while keeping the > features as is. Since the number of calls is roughly the same for when > the performance is good and bad, it seems to depend on what the text > is that the iterator is traversing. >
Re: Solr Float/Double multivalues fields
Op vr 3 jul. 2020 om 14:11 schreef Bram Van Dam : > On 03/07/2020 09:50, Thomas Corthals wrote: > > I think this should go in the ref guide. If your product depends on this > > behaviour, you want reassurance that it isn't going to change in the next > > release. Not everyone will go looking through the javadoc to see if this > is > > implied. > > This is in the ref guide. Section DocValues. Here's the quote: > > DocValues are only available for specific field types. The types chosen > determine the underlying Lucene > docValue type that will be used. The available Solr field types are: > • StrField, and UUIDField: > ◦ If the field is single-valued (i.e., multi-valued is false), Lucene > will use the SORTED type. > ◦ If the field is multi-valued, Lucene will use the SORTED_SET type. > Entries are kept in sorted order and > duplicates are removed. > • BoolField: > ◦ If the field is single-valued (i.e., multi-valued is false), Lucene > will use the SORTED type. > © 2019, Apache Software Foundation > Guide Version 7.7 - Published: 2019-03-04 > Page 212 of 1426 > Apache Solr Reference Guide 7.7 > ◦ If the field is multi-valued, Lucene will use the SORTED_SET type. > Entries are kept in sorted order and > duplicates are removed. > • Any *PointField Numeric or Date fields, EnumFieldType, and > CurrencyFieldType: > ◦ If the field is single-valued (i.e., multi-valued is false), Lucene > will use the NUMERIC type. > ◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type. > Entries are kept in sorted order > and duplicates are kept. > • Any of the deprecated Trie* Numeric or Date fields, EnumField and > CurrencyField: > ◦ If the field is single-valued (i.e., multi-valued is false), Lucene > will use the NUMERIC type. > ◦ If the field is multi-valued, Lucene will use the SORTED_SET type. > Entries are kept in sorted order and > duplicates are removed. > These Lucene types are related to how the values are sorted and stored. > Great for docValues. But I couldn't find anything similar for multiValued in the field type pages of the ref guide (unless I totally missed it of course). It doesn't have to be as elaborate, as long as it's clear and doesn't leave users wondering or assuming.
Re: Solr Float/Double multivalues fields
On 03/07/2020 09:50, Thomas Corthals wrote: > I think this should go in the ref guide. If your product depends on this > behaviour, you want reassurance that it isn't going to change in the next > release. Not everyone will go looking through the javadoc to see if this is > implied. This is in the ref guide. Section DocValues. Here's the quote: DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are: • StrField, and UUIDField: ◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type. ◦ If the field is multi-valued, Lucene will use the SORTED_SET type. Entries are kept in sorted order and duplicates are removed. • BoolField: ◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type. © 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04 Page 212 of 1426 Apache Solr Reference Guide 7.7 ◦ If the field is multi-valued, Lucene will use the SORTED_SET type. Entries are kept in sorted order and duplicates are removed. • Any *PointField Numeric or Date fields, EnumFieldType, and CurrencyFieldType: ◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type. ◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type. Entries are kept in sorted order and duplicates are kept. • Any of the deprecated Trie* Numeric or Date fields, EnumField and CurrencyField: ◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type. ◦ If the field is multi-valued, Lucene will use the SORTED_SET type. Entries are kept in sorted order and duplicates are removed. These Lucene types are related to how the values are sorted and stored.
Re: Adding solr-core via maven fails
If you feel strongly that Solr needs to keep up the Maven bits up to date, you can volunteer to help maintain it, Solr is open source after all. > On Jul 3, 2020, at 12:08 AM, Ali Akhtar wrote: > > I had to add an additional repository to get the failing dependency to > resolve: > > resolvers += "Spring Plugins Repository" at " > https://repo.spring.io/plugins-release/"; > >> However, we do not officially support Maven builds, > > Um, why? This is a java based project, and maven is the de-facto standard > for Java. What if someone wanted to make use of any of Solr's java > libraries in their own JVM based project? There's no (clean) way to do it > other than adding it as a maven dependency and importing the class into > their code. > > > On Thu, Jul 2, 2020 at 6:07 PM Mike Drob wrote: > >> Does it fail similarly on 8.5.0 and .1? >> >> On Thu, Jul 2, 2020 at 6:38 AM Erick Erickson >> wrote: >> >>> There have been some issues with Maven, see: >>> https://issues.apache.org/jira/browse/LUCENE-9170 >>> >>> However, we do not officially support Maven builds, they’re there as a >>> convenience, so there may still >>> be issues in future. >>> On Jul 2, 2020, at 1:27 AM, Ali Akhtar wrote: If I try adding solr-core to an existing project, e.g (SBT): libraryDependencies += "org.apache.solr" % "solr-core" % "8.5.2" It fails due a 404 on the dependencies: Extracting structure failed stack trace is suppressed; run last update for the full output stack trace is suppressed; run last ssExtractDependencies for the full output (update) sbt.librarymanagement.ResolveException: Error downloading org.restlet.jee:org.restlet:2.4.0 Not found Not found not found: /home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml not found: >>> >> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0 Not found Not found not found: >>> >> /home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml not found: >>> >> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom (ssExtractDependencies) sbt.librarymanagement.ResolveException: Error downloading org.restlet.jee:org.restlet:2.4.0 Not found Not found not found: /home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml not found: >>> >> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0 Not found Not found not found: >>> >> /home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml not found: >>> >> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom Any ideas? Do I need to add a specific repository to get it to compile? >>> >>> >>
Out of memory errors with Spatial indexing
We are seeing OOM errors when trying to index some spatial data. I believe the data itself might not be valid but it shouldn't cause the Server to crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is causing the error. { "id": "bad_data_1", "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30, 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0 1.000150474662E30)" } Above dynamic field is mapped to field type "location_rpt" ( solr.SpatialRecursivePrefixTreeFieldType). Any pointers to get around this issue would be highly appreciated. Thanks!
Re: Time-out errors while indexing (Solr 7.7.1)
If you’re seeing low CPU utilization at the same time, you probably just have too much data on too little hardware. Check your swapping, how much of your I/O is just because Lucene can’t hold all the parts of the index it needs in memory at once? Lucene uses MMapDirectory to hold the index and you may well be swapping, see: https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html But my guess is that you’ve just reached a tipping point. You say: "From last 2-3 weeks we have been noticing either slow indexing or timeout errors while indexing” So have you been continually adding more documents to your collections for more than the 2-3 weeks? If so you may have just put so much data on the same boxes that you’ve gone over the capacity of your hardware. As Toke says, adding physical memory for the OS to use to hold relevant parts of the index may alleviate the problem (again, refer to Uwe’s article for why). All that said, if you’re going to keep adding document you need to seriously think about adding new machines and moving some of your replicas to them. Best, Erick > On Jul 3, 2020, at 7:14 AM, Toke Eskildsen wrote: > > On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote: >> We are performing QA performance testing on couple of collections >> which holds 2 billion and 3.5 billion docs respectively. > > How many shards? > >> 1. Our performance team noticed that read operations are pretty >> more than write operations like 100:1 ratio, is this expected during >> indexing or solr nodes are doing any other operations like syncing? > > Are you saying that there are 100 times more read operations when you > are indexing? That does not sound too unrealistic as the disk cache > might be filled with the data that the writers are flushing. > > In that case, more RAM would help. Okay, more RAM nearly always helps, > but such massive difference in IO-utilization does indicate that you > are starved for cache. > > I noticed you have at least 18 replicas. That's a lot. Just to sanity > check: How many replicas are each physical box handling? If they are > sharing resources, fewer replicas would probably be better. > >> 3. Our client timeout is set to 2mins, can they increase further >> more? Would that help or create any other problems? > > It does not hurt the server to increase the client timeout as the > initiated query will keep running until it is finished, independent of > whether or not there is a client to receive the result. > > If you want a better max time for query processing, you should look at > > https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter > but due to its inherent limitations it might not help in your > situation. > >> 4. When we created an empty collection and loaded same data file, >> it loaded fine without any issues so having more documents in a >> collection would create such problems? > > Solr 7 does have a problem with sparse DocValues and many documents, > leading to excessive IO-activity, which might be what you are seeing. I > can see from an earlier post that you were using streaming expressions > for another collection: This is one of the things that are affected by > the Solr 7 DocValues issue. > > More info about DocValues and streaming: > https://issues.apache.org/jira/browse/SOLR-13013 > > Fairly in-depth info on the problem with Solr 7 docValues: > https://issues.apache.org/jira/browse/LUCENE-8374 > > If this is your problem, upgrading to Solr 8 and indexing the > collection from scratch should fix it. > > Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7 > or you can ensure that there are values defined for all DocValues- > fields in all your documents. > >> java.net.SocketTimeoutException: Read timed out >>at java.net.SocketInputStream.socketRead0(Native Method) > ... >> Remote error message: java.util.concurrent.TimeoutException: Idle >> timeout expired: 60/60 ms > > There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You > should be able to change it in solr.xml. > https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html > > BUT if an update takes > 10 minutes to be processed, it indicates that > the cluster is overloaded. Increasing the timeout is just a band-aid. > > - Toke Eskildsen, Royal Danish Library > >
Re: Solr Float/Double multivalues fields
On Fri, 2020-07-03 at 10:00 +0200, Vincenzo D'Amore wrote: > Hi Erick, not sure I got. > Does this mean that the order of values within a multivalued field: > - docValues=true the result will be both re-ordered and deduplicated. > - docValues=false the result order is guaranteed to be maintained for > values in the insertion-order. > > Is this correct? Sorta, but it is not the complete picture. Things gets complicated when you mix it with stored, so that you have "stored=true docValues=true". There's an article about that at https://sease.io/2020/03/docvalues-vs-stored-fields-apache-solr-features-and-performance-smackdown.html BTW: The documentation should definitely mention that stored preserves order & duplicates. It is not obvious. - Toke Eskildsen, Royal Danish Library
Re: Time-out errors while indexing (Solr 7.7.1)
On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote: > We are performing QA performance testing on couple of collections > which holds 2 billion and 3.5 billion docs respectively. How many shards? > 1. Our performance team noticed that read operations are pretty > more than write operations like 100:1 ratio, is this expected during > indexing or solr nodes are doing any other operations like syncing? Are you saying that there are 100 times more read operations when you are indexing? That does not sound too unrealistic as the disk cache might be filled with the data that the writers are flushing. In that case, more RAM would help. Okay, more RAM nearly always helps, but such massive difference in IO-utilization does indicate that you are starved for cache. I noticed you have at least 18 replicas. That's a lot. Just to sanity check: How many replicas are each physical box handling? If they are sharing resources, fewer replicas would probably be better. > 3. Our client timeout is set to 2mins, can they increase further > more? Would that help or create any other problems? It does not hurt the server to increase the client timeout as the initiated query will keep running until it is finished, independent of whether or not there is a client to receive the result. If you want a better max time for query processing, you should look at https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter but due to its inherent limitations it might not help in your situation. > 4. When we created an empty collection and loaded same data file, > it loaded fine without any issues so having more documents in a > collection would create such problems? Solr 7 does have a problem with sparse DocValues and many documents, leading to excessive IO-activity, which might be what you are seeing. I can see from an earlier post that you were using streaming expressions for another collection: This is one of the things that are affected by the Solr 7 DocValues issue. More info about DocValues and streaming: https://issues.apache.org/jira/browse/SOLR-13013 Fairly in-depth info on the problem with Solr 7 docValues: https://issues.apache.org/jira/browse/LUCENE-8374 If this is your problem, upgrading to Solr 8 and indexing the collection from scratch should fix it. Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7 or you can ensure that there are values defined for all DocValues- fields in all your documents. > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) ... > Remote error message: java.util.concurrent.TimeoutException: Idle > timeout expired: 60/60 ms There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You should be able to change it in solr.xml. https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html BUT if an update takes > 10 minutes to be processed, it indicates that the cluster is overloaded. Increasing the timeout is just a band-aid. - Toke Eskildsen, Royal Danish Library
Re: ***URGENT***Re: Questions about Solr Search
Seriously. Doug answered all of your questions. > On Jul 3, 2020, at 6:12 AM, Atri Sharma wrote: > > Please do not cross post. I believe your questions were already answered? > >> On Fri, Jul 3, 2020 at 3:08 PM Gautam K wrote: >> >> Since it's a bit of an urgent request so if could please help me on this by >> today it will be highly appreciated. >> >> Thanks & Regards, >> Gautam Kanaujia >> >>> On Thu, Jul 2, 2020 at 7:49 PM Gautam K wrote: >>> >>> Dear Team, >>> >>> Hope you all are doing well. >>> >>> Can you please help with the following question? We are using Solr search >>> in our Organisation and now checking whether Solr provides search >>> capabilities like Google Enterprise search(Google Knowledge Graph Search). >>> >>> 1, Does Solr Search provide Voice Search like Google? >>> 2. Does Solar Search provide NLP Search(Natural Language Processing)? >>> 3. Does Solr have all the capabilities which Google Knowledge Graph >>> provides like below? >>> >>> Getting a ranked list of the most notable entities that match certain >>> criteria. >>> Predictively completing entities in a search box. >>> Annotating/organizing content using the Knowledge Graph entities. >>> >>> >>> Your help will be appreciated highly. >>> >>> Many thanks >>> Gautam Kanaujia >>> India > > -- > Regards, > > Atri > Apache Concerted
Re: ***URGENT***Re: Questions about Solr Search
Please do not cross post. I believe your questions were already answered? On Fri, Jul 3, 2020 at 3:08 PM Gautam K wrote: > > Since it's a bit of an urgent request so if could please help me on this by > today it will be highly appreciated. > > Thanks & Regards, > Gautam Kanaujia > > On Thu, Jul 2, 2020 at 7:49 PM Gautam K wrote: >> >> Dear Team, >> >> Hope you all are doing well. >> >> Can you please help with the following question? We are using Solr search in >> our Organisation and now checking whether Solr provides search capabilities >> like Google Enterprise search(Google Knowledge Graph Search). >> >> 1, Does Solr Search provide Voice Search like Google? >> 2. Does Solar Search provide NLP Search(Natural Language Processing)? >> 3. Does Solr have all the capabilities which Google Knowledge Graph provides >> like below? >> >> Getting a ranked list of the most notable entities that match certain >> criteria. >> Predictively completing entities in a search box. >> Annotating/organizing content using the Knowledge Graph entities. >> >> >> Your help will be appreciated highly. >> >> Many thanks >> Gautam Kanaujia >> India -- Regards, Atri Apache Concerted
Re: How to use two search string in a single solr query
Hi Thanks Erick and Walter for your response. Solr Version Used : 6.5.0 I tried to elaborate the issue: Case 1 : Search String : Industrial Electric Oven Results=945 Case 2 : Search String : Dell laptop bags Results=992 In above both cases, mm play its role.(match any 2 words out of 3) Now, I want to search with both string and mm still playing its role. q=(Industrial Electric Oven) OR (Dell laptop bags) I want mm still play its role in matching two out of three words in both the cases. Ex: Documents containing electric oven, industrial oven, dell bags, laptop bags should be returned. I don't want the documents containing only dell, bags,etc. Also no document containing electric bags. Regards, Tushar Arora On Thu, 2 Jul 2020 at 22:37, Walter Underwood wrote: > First, remove the “mm” parameter from the request handler definition. That > can > be added back in and tweaked later, or just left out. > > Second, you don’t need any query syntax to search for two words. This > query > should work fine: > > books bags > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Jul 1, 2020, at 10:22 PM, Tushar Arora wrote: > > > > Hi, > > I have a scenario with following entry in the request handler(handler1) > of > > solrconfig.xml.(defType=edismax is used) > > description category > "qf">title^4 demand^0.3 > > 2<-1 4<-30% > > > > When I searched 'bags' as a search string, solr returned 15000 results. > > Query Used : > > > http://localhost:8984/solr/core_name/select?fl=title&indent=on&q=bags&qt=handler1&rows=10&wt=json > > > > And when searched 'books' as a search string, solr returns say 3348 > results. > > Query Used : > > > http://localhost:8984/solr/core_name/select?fl=title&indent=on&q=books&qt=handler1&rows=10&wt=json > > > > I want to use both 'bags' and 'books' as a search string in a single > query. > > I used the following query: > > > http://localhost:8984/solr/core_name/select?fl=title&indent=on&q=%22bags%22+OR+%22books%22&qt=handler1&rows=10&wt=json > > But OR operator not working. It will only give 7 results. > > > > > > I even tried this : > > > http://localhost:8984/solr/core_name/select?fl=title&indent=on&q=(bags)+OR+(books)&qt=handler1&rows=10&wt=json > > But it also gives 7 results. > > > > But my concern is to include the result of both 'bags' OR 'books' in a > > single query. > > Is there any way to use two search strings in a single query? > >
Re: Solr Float/Double multivalues fields
Hi Erick, not sure I got. Does this mean that the order of values within a multivalued field: - docValues=true the result will be both re-ordered and deduplicated. - docValues=false the result order is guaranteed to be maintained for values in the insertion-order. Is this correct? On Thu, Jul 2, 2020 at 8:37 PM Erick Erickson wrote: > This is true _unless_ you fetch from docValues. docValues are SORTED_SETs, > so the results will be both ordered and deduplicated if you return them > as part of the field list. > > Don’t really think it needs to go into the ref guide, it’s just inherent > in storing > any kind of value. You wouldn’t expect multiple text entries in a > multiValued > field to be rearranged when returning the stored values either. > > Best, > Erick > > > On Jul 2, 2020, at 2:21 PM, Vincenzo D'Amore wrote: > > > > Thanks, and genuinely asking: is there written somewhere in the > > documentation too? If no, could anyone suggest to me which doc page > should > > I try to update? > > > > On Thu, Jul 2, 2020 at 8:08 PM Colvin Cowie > > wrote: > > > >> The order of values within a multivalued field should match the > insertion > >> order. -- we certainly rely on that in our product. > >> > >> Order is guaranteed to be maintained for values in a multi-valued field. > >>> > >> > >> > https://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4028057.html > >> > >> On Thu, 2 Jul 2020 at 18:52, Vincenzo D'Amore > wrote: > >> > >>> Hi all, > >>> > >>> simple question: Solr float/double multivalue fields preserve the order > >> of > >>> inserted values? > >>> > >>> Best regards, > >>> Vincenzo > >>> > >>> -- > >>> Vincenzo D'Amore > >>> > >> > > > > > > -- > > Vincenzo D'Amore > > -- Vincenzo D'Amore
Changing Response for Group Query - Custom Request Handler
Dear Community, I am currently working on a Solr Custom Plugin, which - for a group query - adds both total matches and number of groups to the response and also keeps the response format as if it is not a group query. One additional requirement is that numFound should contain the number of groups instead of total matches. This whole thing is required since we have some limitations on the consumer side and want to keep the same response format. If I use group.main=true, I get the results in "usual" format, but the group.ngroups=true does not effect anymore. So, I decided to create a custom request handler in order to get number of groups and create the response format which we need for the requester. This format is basically: ---{ //... "response": { "numFound": 25, // contains number of grouped results --> for pagination reasons "total": 401 // contains number of total results "start": 0, "docs": [ { "id": "26207825" }, // ... } --- My first question would be: is this a good approach with a custom request handler, or is there a better/easier approach to achieve this. I went ahead and (tried) to implement the custom request handler. From a group request (group.format=simple&&group.ngroups=true), I can get a "docList" (or docSlice) from the "grouped" part, but basically putting this docList to the response gives me the same result with group.main=true. Therefore I thought, I would create a new docSlice with mostly the content from the old docSlice and put this in the response. That works for start=0, but as soon as I start pagination and set start anything else than zero, I get an error. Because, even though the start is not zero, the docs[] holds skipped results documents to, but with document iterator I can only access a subset of it, so my new docSlice would only contain a subset of documents. Here is the method for creating a new docSlice: ---private DocSlice getGroupedDocList(SimpleOrderedMap grouped, int numGroups) { DocSlice docSlice = (DocSlice) grouped.get("doclist"); int offset = docSlice.offset(); int docSize = docSlice.size(); int len = offset + docSize; boolean hasScore = docSlice.hasScores(); int[] docs = new int[len]; float[] scores = new float[len]; DocIterator iterator = docSlice.iterator(); int i = 0; while (i < len && iterator.hasNext()) { docs[i] = iterator.nextDoc(); LOGGER.error("Doc added: " + docs[i]); if (hasScore) { scores[i] = iterator.score(); } i++; } long matches = numGroups; float maxScore = docSlice.maxScore(); DocSlice newDocSlice = new DocSlice(offset, len, docs, scores, matches, maxScore); return newDocSlice; } --- I am kind of stuck at this point. Can someone maybe help me? Best regards, Deniz C.
Re: Solr Float/Double multivalues fields
I think this should go in the ref guide. If your product depends on this behaviour, you want reassurance that it isn't going to change in the next release. Not everyone will go looking through the javadoc to see if this is implied. Typically it'll either be something like "are always returned in insertion order" or "are currently returned in insertion order, but your code shouldn't rely on this behaviour because it can change in future releases". That's usually sufficient to make an informed decision on how to handle returned values. If it's different for docValues, that's even more reason to state it clearly in the ref guide to avoid confusion. Best, Thomas Op do 2 jul. 2020 om 20:37 schreef Erick Erickson : > This is true _unless_ you fetch from docValues. docValues are SORTED_SETs, > so the results will be both ordered and deduplicated if you return them > as part of the field list. > > Don’t really think it needs to go into the ref guide, it’s just inherent > in storing > any kind of value. You wouldn’t expect multiple text entries in a > multiValued > field to be rearranged when returning the stored values either. > > Best, > Erick > > > On Jul 2, 2020, at 2:21 PM, Vincenzo D'Amore wrote: > > > > Thanks, and genuinely asking: is there written somewhere in the > > documentation too? If no, could anyone suggest to me which doc page > should > > I try to update? > > > > On Thu, Jul 2, 2020 at 8:08 PM Colvin Cowie > > wrote: > > > >> The order of values within a multivalued field should match the > insertion > >> order. -- we certainly rely on that in our product. > >> > >> Order is guaranteed to be maintained for values in a multi-valued field. > >>> > >> > >> > https://lucene.472066.n3.nabble.com/order-question-on-solr-multi-value-field-tp4027695p4028057.html > >> > >> On Thu, 2 Jul 2020 at 18:52, Vincenzo D'Amore > wrote: > >> > >>> Hi all, > >>> > >>> simple question: Solr float/double multivalue fields preserve the order > >> of > >>> inserted values? > >>> > >>> Best regards, > >>> Vincenzo > >>> > >>> -- > >>> Vincenzo D'Amore > >>> > >> > > > > > > -- > > Vincenzo D'Amore > >