Re: maxChars no longer working on CopyField as of 7.7
Thanks Erik. I created SOLR-13699. I agree wrt adding a Unit Test, that was my thinking as well. I am currently working on a test, and then I will submit my patch. Thanks, Chris On Thu, Aug 15, 2019 at 1:06 PM Erick Erickson wrote: > Chris: > > I certainly don’t see anything in JIRA about this, so please do raise a > JIRA, > especially as you already have a patch! > > It’d be great if you added a test case demonstrating this that fails > without > your patch and succeeds after. I’d just add a method to one of the > existing tests, maybe in > solr/core/src/test/org/apache/solr/schema/CopyFieldTest.java? > > No need to make this a SolrCloud test. > > Best, > Erick > > > On Aug 15, 2019, at 11:31 AM, Chris Troullis > wrote: > > > > Hi all, > > > > We recently upgraded from Solr 7.3 to 8.1, and noticed that the maxChars > > property on a copy field is no longer functioning as designed. Per the > most > > recent documentation it looks like there have been no intentional changes > > as to the functionality of this property, so I assume this is a bug. > > > > In debugging the issue, it looks like the bug was caused by SOLR-12992. > In > > DocumentBuilder where the maxChar limit is applied, it first checks if > the > > value is instanceof String. As of SOLR-12992, string values are now > coming > > in as ByteArrayUtf8CharSequence (unless they are above a certain size as > > defined by JavaBinCodec.MAX_UTF8_SZ), so they are failing the instanceof > > String check, and the maxChar truncation is not being applied. > > > > I went to log a bug but figured I would double check on here first just > to > > confirm that people think that this is actually a bug and I'm not going > > crazy. Let me know what you think, and I will log the bug. > > > > I have implemented a fix which I am currently testing and will be happy > to > > submit a patch, assuming it's agreed that this is not intended behavior. > > > > Thanks, > > Chris > >
maxChars no longer working on CopyField as of 7.7
Hi all, We recently upgraded from Solr 7.3 to 8.1, and noticed that the maxChars property on a copy field is no longer functioning as designed. Per the most recent documentation it looks like there have been no intentional changes as to the functionality of this property, so I assume this is a bug. In debugging the issue, it looks like the bug was caused by SOLR-12992. In DocumentBuilder where the maxChar limit is applied, it first checks if the value is instanceof String. As of SOLR-12992, string values are now coming in as ByteArrayUtf8CharSequence (unless they are above a certain size as defined by JavaBinCodec.MAX_UTF8_SZ), so they are failing the instanceof String check, and the maxChar truncation is not being applied. I went to log a bug but figured I would double check on here first just to confirm that people think that this is actually a bug and I'm not going crazy. Let me know what you think, and I will log the bug. I have implemented a fix which I am currently testing and will be happy to submit a patch, assuming it's agreed that this is not intended behavior. Thanks, Chris
Re: Suggestions for debugging performance issue
FYI to all, just as an update, we rebuilt the index in question from scratch for a second time this weekend and the problem went away on 1 node, but we were still seeing it on the other node. After restarting the problematic node, the problem went away. Still makes me a little uneasy as we weren't able to determine the cause, but at least we are back to normal query times now. Chris On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis wrote: > Thanks Shawn, > > As mentioned previously, we are hard committing every 60 seconds, which we > have been doing for years, and have had no issues until enabling CDCR. We > have never seen large tlog sizes before, and even manually issuing a hard > commit to the collection does not reduce the size of the tlogs. I believe > this is because when using the CDCRUpdateLog the tlogs are not purged until > the docs have been replicated over. Anyway, since we manually purged the > tlogs they seem to now be staying at an acceptable size, so I don't think > that is the cause. The documents are not abnormally large, maybe ~20 > string/numeric fields with simple whitespace tokenization. > > To answer your questions: > > -Solr version: 7.2.1 > -What OS vendor and version Solr is running on: CentOS 6 > -Total document count on the server (counting all index cores): 13 > collections totaling ~60 million docs > -Total index size on the server (counting all cores): ~60GB > -What the total of all Solr heaps on the server is - 16GB heap (we had to > increase for CDCR because it was using a lot more heap). > -Whether there is software other than Solr on the server - No > -How much total memory the server has installed - 64 GB > > All of this has been consistent for multiple years across multiple Solr > versions and we have only started seeing this issue once we started using > the CDCRUpdateLog and CDCR, hence why that is the only real thing we can > point to. And again, the issue is only affecting 1 of the 13 collections on > the server, so if it was hardware/heap/GC related then I would think we > would be seeing it for every collection, not just one, as they all share > the same resources. > > I will take a look at the GC logs, but I don't think that is the cause. > The consistent nature of the slow performance doesn't really point to GC > issues, and we have profiling set up in New Relic and it does not show any > long/frequent GC pauses. > > We are going to try and rebuild the collection from scratch again this > weekend as that has solved the issue in some lower environments, although > it's not really consistent. At this point it's all we can think of to do. > > Thanks, > > Chris > > > On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey wrote: > >> On 6/12/2018 12:06 PM, Chris Troullis wrote: >> > The issue we are seeing is with 1 collection in particular, after we >> set up >> > CDCR, we are getting extremely slow response times when retrieving >> > documents. Debugging the query shows QTime is almost nothing, but the >> > overall responseTime is like 5x what it should be. The problem is >> > exacerbated by larger result sizes. IE retrieving 25 results is almost >> > normal, but 200 results is way slower than normal. I can run the exact >> same >> > query multiple times in a row (so everything should be cached), and I >> still >> > see response times way higher than another environment that is not using >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just >> that >> > we are using the CDCRUpdateLog. The problem started happening even >> before >> > we enabled CDCR. >> > >> > In a lower environment we noticed that the transaction logs were huge >> > (multiple gigs), so we tried stopping solr and deleting the tlogs then >> > restarting, and that seemed to fix the performance issue. We tried the >> same >> > thing in production the other day but it had no effect, so now I don't >> know >> > if it was a coincidence or not. >> >> There is one other cause besides CDCR buffering that I know of for huge >> transaction logs, and it has nothing to do with CDCR: A lack of hard >> commits. It is strongly recommended to have autoCommit set to a >> reasonably short interval (about a minute in my opinion, but 15 seconds >> is VERY common). Most of the time openSearcher should be set to false >> in the autoCommit config, and other mechanisms (which might include >> autoSoftCommit) should be used for change visibility. The example >> autoCommit settings might seem superfluous because they don't affect >> what's searchable, but it is actually a very important configurati
Re: Suggestions for debugging performance issue
Thanks Shawn, As mentioned previously, we are hard committing every 60 seconds, which we have been doing for years, and have had no issues until enabling CDCR. We have never seen large tlog sizes before, and even manually issuing a hard commit to the collection does not reduce the size of the tlogs. I believe this is because when using the CDCRUpdateLog the tlogs are not purged until the docs have been replicated over. Anyway, since we manually purged the tlogs they seem to now be staying at an acceptable size, so I don't think that is the cause. The documents are not abnormally large, maybe ~20 string/numeric fields with simple whitespace tokenization. To answer your questions: -Solr version: 7.2.1 -What OS vendor and version Solr is running on: CentOS 6 -Total document count on the server (counting all index cores): 13 collections totaling ~60 million docs -Total index size on the server (counting all cores): ~60GB -What the total of all Solr heaps on the server is - 16GB heap (we had to increase for CDCR because it was using a lot more heap). -Whether there is software other than Solr on the server - No -How much total memory the server has installed - 64 GB All of this has been consistent for multiple years across multiple Solr versions and we have only started seeing this issue once we started using the CDCRUpdateLog and CDCR, hence why that is the only real thing we can point to. And again, the issue is only affecting 1 of the 13 collections on the server, so if it was hardware/heap/GC related then I would think we would be seeing it for every collection, not just one, as they all share the same resources. I will take a look at the GC logs, but I don't think that is the cause. The consistent nature of the slow performance doesn't really point to GC issues, and we have profiling set up in New Relic and it does not show any long/frequent GC pauses. We are going to try and rebuild the collection from scratch again this weekend as that has solved the issue in some lower environments, although it's not really consistent. At this point it's all we can think of to do. Thanks, Chris On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey wrote: > On 6/12/2018 12:06 PM, Chris Troullis wrote: > > The issue we are seeing is with 1 collection in particular, after we set > up > > CDCR, we are getting extremely slow response times when retrieving > > documents. Debugging the query shows QTime is almost nothing, but the > > overall responseTime is like 5x what it should be. The problem is > > exacerbated by larger result sizes. IE retrieving 25 results is almost > > normal, but 200 results is way slower than normal. I can run the exact > same > > query multiple times in a row (so everything should be cached), and I > still > > see response times way higher than another environment that is not using > > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that > > we are using the CDCRUpdateLog. The problem started happening even before > > we enabled CDCR. > > > > In a lower environment we noticed that the transaction logs were huge > > (multiple gigs), so we tried stopping solr and deleting the tlogs then > > restarting, and that seemed to fix the performance issue. We tried the > same > > thing in production the other day but it had no effect, so now I don't > know > > if it was a coincidence or not. > > There is one other cause besides CDCR buffering that I know of for huge > transaction logs, and it has nothing to do with CDCR: A lack of hard > commits. It is strongly recommended to have autoCommit set to a > reasonably short interval (about a minute in my opinion, but 15 seconds > is VERY common). Most of the time openSearcher should be set to false > in the autoCommit config, and other mechanisms (which might include > autoSoftCommit) should be used for change visibility. The example > autoCommit settings might seem superfluous because they don't affect > what's searchable, but it is actually a very important configuration to > keep. > > Are the docs in this collection really big, by chance? > > As I went through previous threads you've started on the mailing list, I > have noticed that none of your messages provided some details that would > be useful for looking into performance problems: > > * What OS vendor and version Solr is running on. > * Total document count on the server (counting all index cores). > * Total index size on the server (counting all cores). > * What the total of all Solr heaps on the server is. > * Whether there is software other than Solr on the server. > * How much total memory the server has installed. > > If you name the OS, I can use that information to help you gather some > additional info which will actually show me most of that list. Total > docum
Re: Suggestions for debugging performance issue
Hi Susheel, It's not drastically different no. There are other collections with more fields and more documents that don't have this issue. And the collection is not sharded. Just 1 shard with 2 replicas. Both replicas are similar in response time. Thanks, Chris On Wed, Jun 13, 2018 at 2:37 PM, Susheel Kumar wrote: > Is this collection anyway drastically different than others in terms of > schema/# of fields/total document etc is it sharded and if so can you look > which shard taking more time with shard.info=true. > > Thnx > Susheel > > On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis > wrote: > > > Thanks Erick, > > > > Seems to be a mixed bag in terms of tlog size across all of our indexes, > > but currently the index with the performance issues has 4 tlog files > > totally ~200 MB. This still seems high to me since the collections are in > > sync, and we hard commit every minute, but it's less than the ~8GB it was > > before we cleaned them up. Spot checking some other indexes show some > have > > tlogs >3GB, but none of those indexes are having performance issues (on > the > > same solr node), so I'm not sure it's related. We have 13 collections of > > various sizes running on our solr cloud cluster, and none of them seem to > > have this issue except for this one index, which is not our largest index > > in terms of size on disk or number of documents. > > > > As far as the response intervals, just running a default search *:* > sorting > > on our id field so that we get consistent results across environments, > and > > returning 200 results (our max page size in app) with ~20 fields, we see > > times of ~3.5 seconds in production, compared to ~1 second on one of our > > lower environments with an exact copy of the index. Both have CDCR > enabled > > and have identical clusters. > > > > Unfortunately, currently the only instance we are seeing the issue on is > > production, so we are limited in the tests that we can run. I did confirm > > in the lower environment that the doc cache is large enough to hold all > of > > the results, and that both the doc and query caches should be serving the > > results. Obviously production we have much more indexing going on, but we > > do utilize autowarming for our caches so our response times are still > > stable across new searchers. > > > > We did move the lower environment to the same ESX host as our production > > cluster, so that it is getting resources from the same pool (CPU, RAM, > > etc). The only thing that is different is the disks, but the lower > > environment is running on slower disks than production. And if it was a > > disk issue you would think it would be affecting all of the collections, > > not just this one. > > > > It's a mystery! > > > > Chris > > > > > > > > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > > > First, nice job of eliminating all the standard stuff! > > > > > > About tlogs: Sanity check: They aren't growing again, right? They > > > should hit a relatively steady state. The tlogs are used as a queueing > > > mechanism for CDCR to durably store updates until they can > > > successfully be transmitted to the target. So I'd expect them to hit a > > > fairly steady number. > > > > > > Your lack of CPU/IO spikes is also indicative of something weird, > > > somehow Solr just sitting around doing nothing. What intervals are we > > > talking about here for response? 100ms? 5000ms? > > > > > > When you hammer the same query over and over, you should see your > > > queryResultCache hits increase. If that's the case, Solr is doing no > > > work at all for the search, just assembling the resopnse packet which, > > > as you say, should be in the documentCache. This assumes it's big > > > enough to hold all of the docs that are requested by all the > > > simultaneous requests. The queryResultCache cache will be flushed > > > every time a new searcher is opened. So if you still get your poor > > > response times, and your queryResultCache hits are increasing then > > > Solr is doing pretty much nothing. > > > > > > So does this behavior still occur if you aren't adding docs to the > > > index? If you turn indexing off as a test, that'd be another data > > > point. > > > > > > And, of course, if it's at all possible to just take the CDCR > > > configuration out of your solrconfig file temporarily that'd nail > > &
Re: Suggestions for debugging performance issue
Thanks Erick, Seems to be a mixed bag in terms of tlog size across all of our indexes, but currently the index with the performance issues has 4 tlog files totally ~200 MB. This still seems high to me since the collections are in sync, and we hard commit every minute, but it's less than the ~8GB it was before we cleaned them up. Spot checking some other indexes show some have tlogs >3GB, but none of those indexes are having performance issues (on the same solr node), so I'm not sure it's related. We have 13 collections of various sizes running on our solr cloud cluster, and none of them seem to have this issue except for this one index, which is not our largest index in terms of size on disk or number of documents. As far as the response intervals, just running a default search *:* sorting on our id field so that we get consistent results across environments, and returning 200 results (our max page size in app) with ~20 fields, we see times of ~3.5 seconds in production, compared to ~1 second on one of our lower environments with an exact copy of the index. Both have CDCR enabled and have identical clusters. Unfortunately, currently the only instance we are seeing the issue on is production, so we are limited in the tests that we can run. I did confirm in the lower environment that the doc cache is large enough to hold all of the results, and that both the doc and query caches should be serving the results. Obviously production we have much more indexing going on, but we do utilize autowarming for our caches so our response times are still stable across new searchers. We did move the lower environment to the same ESX host as our production cluster, so that it is getting resources from the same pool (CPU, RAM, etc). The only thing that is different is the disks, but the lower environment is running on slower disks than production. And if it was a disk issue you would think it would be affecting all of the collections, not just this one. It's a mystery! Chris On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson wrote: > First, nice job of eliminating all the standard stuff! > > About tlogs: Sanity check: They aren't growing again, right? They > should hit a relatively steady state. The tlogs are used as a queueing > mechanism for CDCR to durably store updates until they can > successfully be transmitted to the target. So I'd expect them to hit a > fairly steady number. > > Your lack of CPU/IO spikes is also indicative of something weird, > somehow Solr just sitting around doing nothing. What intervals are we > talking about here for response? 100ms? 5000ms? > > When you hammer the same query over and over, you should see your > queryResultCache hits increase. If that's the case, Solr is doing no > work at all for the search, just assembling the resopnse packet which, > as you say, should be in the documentCache. This assumes it's big > enough to hold all of the docs that are requested by all the > simultaneous requests. The queryResultCache cache will be flushed > every time a new searcher is opened. So if you still get your poor > response times, and your queryResultCache hits are increasing then > Solr is doing pretty much nothing. > > So does this behavior still occur if you aren't adding docs to the > index? If you turn indexing off as a test, that'd be another data > point. > > And, of course, if it's at all possible to just take the CDCR > configuration out of your solrconfig file temporarily that'd nail > whether CDCR is the culprit or whether it's coincidental. You say that > CDCR is the only difference between the environments, but I've > certainly seen situations where it turns out to be a bad disk > controller or something that's _also_ different. > > Now, assuming all that's inconclusive, I'm afraid the next step would > be to throw a profiler at it. Maybe pull a stack traces. > > Best, > Erick > > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis > wrote: > > Thanks Erick. A little more info: > > > > -We do have buffering disabled everywhere, as I had read multiple posts > on > > the mailing list regarding the issue you described. > > -We soft commit (with opensearcher=true) pretty frequently (15 seconds) > as > > we have some NRT requirements. We hard commit every 60 seconds. We never > > commit manually, only via the autocommit timers. We have been using these > > settings for a long time and have never had any issues until recently. > And > > all of our other indexes are fine (some larger than this one). > > -We do have documentResultCache enabled, although it's not very big. But > I > > can literally spam the same query over and over again with no other > queries > > hitting the box, so all the results should be cached. > > -We don't see any CPU/IO spikes when ru
Re: Suggestions for debugging performance issue
Thanks Erick. A little more info: -We do have buffering disabled everywhere, as I had read multiple posts on the mailing list regarding the issue you described. -We soft commit (with opensearcher=true) pretty frequently (15 seconds) as we have some NRT requirements. We hard commit every 60 seconds. We never commit manually, only via the autocommit timers. We have been using these settings for a long time and have never had any issues until recently. And all of our other indexes are fine (some larger than this one). -We do have documentResultCache enabled, although it's not very big. But I can literally spam the same query over and over again with no other queries hitting the box, so all the results should be cached. -We don't see any CPU/IO spikes when running these queries, our load is pretty much flat on all accounts. I know it seems odd that CDCR would be the culprit, but it's really the only thing we've changed, and we have other environments running the exact same setup with no issues, so it is really making us tear our hair out. And when we cleaned up the huge tlogs it didn't seem to make any difference in the query time (I was originally thinking it was somehow searching through the tlogs for documents, and that's why it was taking so long to retrieve the results, but I don't know if that is actually how it works). Are you aware of any logger settings we could increase to potentially get a better idea of where the time is being spent? I took the eventual query response and just hosted as a static file on the same machine via nginx and it downloaded lightning fast (I was trying to rule out network as the culprit), so it seems like the time is being spent somewhere in solr. Thanks, Chris On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson wrote: > Having the tlogs be huge is a red flag. Do you have buffering enabled > in CDCR? This was something of a legacy option that's going to be > removed, it's been made obsolete by the ability of CDCR to bootstrap > the entire index. Buffering should be disabled always. > > Another reason tlogs can grow is if you have very long times between > hard commits. I doubt that's your issue, but just in case. > > And the final reason tlogs can grow is that the connection between > source and target clusters is broken, but that doesn't sound like what > you're seeing either since you say the target cluster is keeping up. > > The process of assembling the response can be long. If you have any > stored fields (and not docValues-enabled), Solr will > 1> seek the stored data on disk > 2> decompress (min 16K blocks) > 3> transmit the thing back to your client > > The decompressed version of the doc will be held in the > documentResultCache configured in solrconfig.xml, so it may or may not > be cached in memory. That said, this stuff is all MemMapped and the > decompression isn't usually an issue, I'd expect you to see very large > CPU spikes and/or I/O contention if that was the case. > > CDCR shouldn't really be that much of a hit, mostly I/O. Solr will > have to look in the tlogs to get you the very most recent copy, so the > first place I'd look is keeping the tlogs under control first. > > The other possibility (again unrelated to CDCR) is if your spikes are > coincident with soft commits or hard-commits-with-opensearcher-true. > > In all, though, none of the usual suspects seems to make sense here > since you say that absent configuring CDCR things seem to run fine. So > I'd look at the tlogs and my commit intervals. Once the tlogs are > under control then move on to other possibilities if the problem > persists... > > Best, > Erick > > > On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis > wrote: > > Hi all, > > > > Recently we have gone live using CDCR on our 2 node solr cloud cluster > > (7.2.1). From a CDCR perspective, everything seems to be working > > fine...collections are staying in sync across the cluster, everything > looks > > good. > > > > The issue we are seeing is with 1 collection in particular, after we set > up > > CDCR, we are getting extremely slow response times when retrieving > > documents. Debugging the query shows QTime is almost nothing, but the > > overall responseTime is like 5x what it should be. The problem is > > exacerbated by larger result sizes. IE retrieving 25 results is almost > > normal, but 200 results is way slower than normal. I can run the exact > same > > query multiple times in a row (so everything should be cached), and I > still > > see response times way higher than another environment that is not using > > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that > > we are using the CDCRUpdateLog. The problem started happening even before > > we
Suggestions for debugging performance issue
Hi all, Recently we have gone live using CDCR on our 2 node solr cloud cluster (7.2.1). From a CDCR perspective, everything seems to be working fine...collections are staying in sync across the cluster, everything looks good. The issue we are seeing is with 1 collection in particular, after we set up CDCR, we are getting extremely slow response times when retrieving documents. Debugging the query shows QTime is almost nothing, but the overall responseTime is like 5x what it should be. The problem is exacerbated by larger result sizes. IE retrieving 25 results is almost normal, but 200 results is way slower than normal. I can run the exact same query multiple times in a row (so everything should be cached), and I still see response times way higher than another environment that is not using CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that we are using the CDCRUpdateLog. The problem started happening even before we enabled CDCR. In a lower environment we noticed that the transaction logs were huge (multiple gigs), so we tried stopping solr and deleting the tlogs then restarting, and that seemed to fix the performance issue. We tried the same thing in production the other day but it had no effect, so now I don't know if it was a coincidence or not. Things that we have tried: -Completely deleting the collection and rebuilding from scratch -Running the query directly from solr admin to eliminate other causes -Doing a tcpdump on the solr node to eliminate a network issue None of these things have yielded any results. It seems very inconsistent. Some environments we can reproduce it in, others we can't. Hardware/configuration/network is exactly the same between all envrionments. The only thing that we have narrowed it down to is we are pretty sure it has something to do with CDCR, as the issue only started when we started using it. I'm wondering if any of this sparks any ideas from anyone, or if people have suggestions as to how I can figure out what is causing this long query response time? The debug flag on the query seems more geared towards seeing where time is spent in the actual query, which is nothing in my case. The time is spent retrieving the results, which I don't have much information on. I have tried increasing the log level but nothing jumps out at me in the solr logs. Is there something I can look for specifically to help debug this? Thanks, Chris
Re: Weird transaction log behavior with CDCR
Hi Amrit, thanks for the reply. I shut down all of the nodes on the source cluster after the buffer was disabled, and there was no change to the tlogs. On Tue, Apr 17, 2018 at 12:20 PM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > Chris, > > After disabling the buffer on source, kind shut down all the nodes of > source cluster first and then start them again. The tlogs will be removed > accordingly. BTW CDCR doesn't abide by 100 numRecordsToKeep or 10 numTlogs. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Tue, Apr 17, 2018 at 8:58 PM, Susheel Kumar <susheel2...@gmail.com> > wrote: > > > DISABLEBUFFER on source cluster would solve this problem. > > > > On Tue, Apr 17, 2018 at 9:29 AM, Chris Troullis <cptroul...@gmail.com> > > wrote: > > > > > Hi, > > > > > > We are attempting to use CDCR with solr 7.2.1 and are experiencing odd > > > behavior with transaction logs. My understanding is that by default, > solr > > > will keep a maximum of 10 tlog files or 100 records in the tlogs. I > > assume > > > that with CDCR, the records will not be removed from the tlogs until it > > has > > > been confirmed that they have been replicated to the other cluster. > > > However, even when replication has finished and the CDCR queue sizes > are > > 0, > > > we are still seeing large numbers (50+) and large sizes (over a GB) of > > > tlogs sitting on the nodes. > > > > > > We are hard committing once per minute. > > > > > > Doing a lot of reading on the mailing list, I see that a lot of people > > were > > > pointing to buffering being enabled as the cause for some of these > > > transaction log issues. However, we have disabled buffering on both the > > > source and target clusters, and are still seeing the issues. > > > > > > Also, while some of our indexes replicate very rapidly (millions of > > > documents in minutes), other smaller indexes are crawling. If we > restart > > > CDCR on the nodes then it finishes almost instantly. > > > > > > Any thoughts on these behaviors? > > > > > > Thanks, > > > > > > Chris > > > > > >
Weird transaction log behavior with CDCR
Hi, We are attempting to use CDCR with solr 7.2.1 and are experiencing odd behavior with transaction logs. My understanding is that by default, solr will keep a maximum of 10 tlog files or 100 records in the tlogs. I assume that with CDCR, the records will not be removed from the tlogs until it has been confirmed that they have been replicated to the other cluster. However, even when replication has finished and the CDCR queue sizes are 0, we are still seeing large numbers (50+) and large sizes (over a GB) of tlogs sitting on the nodes. We are hard committing once per minute. Doing a lot of reading on the mailing list, I see that a lot of people were pointing to buffering being enabled as the cause for some of these transaction log issues. However, we have disabled buffering on both the source and target clusters, and are still seeing the issues. Also, while some of our indexes replicate very rapidly (millions of documents in minutes), other smaller indexes are crawling. If we restart CDCR on the nodes then it finishes almost instantly. Any thoughts on these behaviors? Thanks, Chris
Re: CDCR Invalid Number on deletes
Nevermind I found itthe link you posted links me to SOLR-12036 instead of SOLR-12063 for some reason. On Tue, Mar 20, 2018 at 1:51 PM, Chris Troullis <cptroul...@gmail.com> wrote: > Hey Amrit, > > Did you happen to see my last reply? Is SOLR-12036 the correct JIRA? > > Thanks, > > Chris > > On Wed, Mar 7, 2018 at 1:52 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > >> Hey Amrit, thanks for the reply! >> >> I checked out SOLR-12036, but it doesn't look like it has to do with >> CDCR, and the patch that is attached doesn't look CDCR related. Are you >> sure that's the correct JIRA number? >> >> Thanks, >> >> Chris >> >> On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar <sarkaramr...@gmail.com> >> wrote: >> >>> Hey Chris, >>> >>> I figured a separate issue while working on CDCR which may relate to your >>> problem. Please see jira: *SOLR-12063* >>> <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This >>> is a >>> bug got introduced when we supported the bidirectional approach where an >>> extra flag in tlog entry for cdcr is added. >>> >>> This part of the code is messing up: >>> *UpdateLog.java.RecentUpdates::update()::* >>> >>> switch (oper) { >>> case UpdateLog.ADD: >>> case UpdateLog.UPDATE_INPLACE: >>> case UpdateLog.DELETE: >>> case UpdateLog.DELETE_BY_QUERY: >>> Update update = new Update(); >>> update.log = oldLog; >>> update.pointer = reader.position(); >>> update.version = version; >>> >>> if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) { >>> update.previousVersion = (Long) entry.get(UpdateLog.PREV_VERSI >>> ON_IDX); >>> } >>> updatesForLog.add(update); >>> updates.put(version, update); >>> >>> if (oper == UpdateLog.DELETE_BY_QUERY) { >>> deleteByQueryList.add(update); >>> } else if (oper == UpdateLog.DELETE) { >>> deleteList.add(new DeleteUpdate(version, >>> (byte[])entry.get(entry.size()-1))); >>> } >>> >>> break; >>> >>> case UpdateLog.COMMIT: >>> break; >>> default: >>> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, >>> "Unknown Operation! " + oper); >>> } >>> >>> deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size() >>> -1))); >>> >>> is expecting the last entry to be the payload, but everywhere in the >>> project, *pos:[2] *is the index for the payload, while the last entry in >>> source code is *boolean* in / after Solr 7.2, denoting update is cdcr >>> forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr >>> sync, >>> checkpoint operations and hence it is a legit bug, slipped the tests I >>> wrote. >>> >>> The immediate fix patch is uploaded and I am awaiting feedback on that. >>> Meanwhile if it is possible for you to apply the patch, build the jar and >>> try it out, please do and let us know. >>> >>> For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if >>> you >>> can comment on the JIRA and post the sample docs, solr logs, relevant >>> information, I can give it a thorough look. >>> >>> Amrit Sarkar >>> Search Engineer >>> Lucidworks, Inc. >>> 415-589-9269 >>> www.lucidworks.com >>> Twitter http://twitter.com/lucidworks >>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >>> Medium: https://medium.com/@sarkaramrit2 >>> >>> On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com> >>> wrote: >>> >>> > Hi all, >>> > >>> > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR >>> bug >>> > fixes and features added that would finally let us be able to make use >>> of >>> > it (bi-directional syncing was the big one). The first time we tried to >>> > implement we ran into all kinds of errors, but this time we were able >>> to >>> > get it mostly working. >>> > >>> > The issue we seem to be having now is that any time a document is >>> deleted >>> > via deleteById from a collection on the primary node, we are flooded >>> with >>
Re: CDCR Invalid Number on deletes
Hey Amrit, Did you happen to see my last reply? Is SOLR-12036 the correct JIRA? Thanks, Chris On Wed, Mar 7, 2018 at 1:52 PM, Chris Troullis <cptroul...@gmail.com> wrote: > Hey Amrit, thanks for the reply! > > I checked out SOLR-12036, but it doesn't look like it has to do with CDCR, > and the patch that is attached doesn't look CDCR related. Are you sure > that's the correct JIRA number? > > Thanks, > > Chris > > On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar <sarkaramr...@gmail.com> > wrote: > >> Hey Chris, >> >> I figured a separate issue while working on CDCR which may relate to your >> problem. Please see jira: *SOLR-12063* >> <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This >> is a >> bug got introduced when we supported the bidirectional approach where an >> extra flag in tlog entry for cdcr is added. >> >> This part of the code is messing up: >> *UpdateLog.java.RecentUpdates::update()::* >> >> switch (oper) { >> case UpdateLog.ADD: >> case UpdateLog.UPDATE_INPLACE: >> case UpdateLog.DELETE: >> case UpdateLog.DELETE_BY_QUERY: >> Update update = new Update(); >> update.log = oldLog; >> update.pointer = reader.position(); >> update.version = version; >> >> if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) { >> update.previousVersion = (Long) entry.get(UpdateLog.PREV_VERSI >> ON_IDX); >> } >> updatesForLog.add(update); >> updates.put(version, update); >> >> if (oper == UpdateLog.DELETE_BY_QUERY) { >> deleteByQueryList.add(update); >> } else if (oper == UpdateLog.DELETE) { >> deleteList.add(new DeleteUpdate(version, >> (byte[])entry.get(entry.size()-1))); >> } >> >> break; >> >> case UpdateLog.COMMIT: >> break; >> default: >> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, >> "Unknown Operation! " + oper); >> } >> >> deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size() >> -1))); >> >> is expecting the last entry to be the payload, but everywhere in the >> project, *pos:[2] *is the index for the payload, while the last entry in >> source code is *boolean* in / after Solr 7.2, denoting update is cdcr >> forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr >> sync, >> checkpoint operations and hence it is a legit bug, slipped the tests I >> wrote. >> >> The immediate fix patch is uploaded and I am awaiting feedback on that. >> Meanwhile if it is possible for you to apply the patch, build the jar and >> try it out, please do and let us know. >> >> For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if >> you >> can comment on the JIRA and post the sample docs, solr logs, relevant >> information, I can give it a thorough look. >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> Medium: https://medium.com/@sarkaramrit2 >> >> On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com> >> wrote: >> >> > Hi all, >> > >> > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR >> bug >> > fixes and features added that would finally let us be able to make use >> of >> > it (bi-directional syncing was the big one). The first time we tried to >> > implement we ran into all kinds of errors, but this time we were able to >> > get it mostly working. >> > >> > The issue we seem to be having now is that any time a document is >> deleted >> > via deleteById from a collection on the primary node, we are flooded >> with >> > "Invalid Number" errors followed by a random sequence of characters when >> > CDCR tries to sync the update to the backup site. This happens on all of >> > our collections where our id fields are defined as longs (some of them >> the >> > ids are compound keys and are strings). >> > >> > Here's a sample exception: >> > >> > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error >> > from server at http://ip/solr/collection_shard1_replica_n1: Invalid >> > Number: ] >> > -s >> > at >> > org.apache.solr.client.solrj.impl.CloudSolrClien
Re: CDCR Invalid Number on deletes
Hey Amrit, thanks for the reply! I checked out SOLR-12036, but it doesn't look like it has to do with CDCR, and the patch that is attached doesn't look CDCR related. Are you sure that's the correct JIRA number? Thanks, Chris On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > Hey Chris, > > I figured a separate issue while working on CDCR which may relate to your > problem. Please see jira: *SOLR-12063* > <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This is > a > bug got introduced when we supported the bidirectional approach where an > extra flag in tlog entry for cdcr is added. > > This part of the code is messing up: > *UpdateLog.java.RecentUpdates::update()::* > > switch (oper) { > case UpdateLog.ADD: > case UpdateLog.UPDATE_INPLACE: > case UpdateLog.DELETE: > case UpdateLog.DELETE_BY_QUERY: > Update update = new Update(); > update.log = oldLog; > update.pointer = reader.position(); > update.version = version; > > if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) { > update.previousVersion = (Long) entry.get(UpdateLog.PREV_ > VERSION_IDX); > } > updatesForLog.add(update); > updates.put(version, update); > > if (oper == UpdateLog.DELETE_BY_QUERY) { > deleteByQueryList.add(update); > } else if (oper == UpdateLog.DELETE) { > deleteList.add(new DeleteUpdate(version, > (byte[])entry.get(entry.size()-1))); > } > > break; > > case UpdateLog.COMMIT: > break; > default: > throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, > "Unknown Operation! " + oper); > } > > deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size() > -1))); > > is expecting the last entry to be the payload, but everywhere in the > project, *pos:[2] *is the index for the payload, while the last entry in > source code is *boolean* in / after Solr 7.2, denoting update is cdcr > forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr sync, > checkpoint operations and hence it is a legit bug, slipped the tests I > wrote. > > The immediate fix patch is uploaded and I am awaiting feedback on that. > Meanwhile if it is possible for you to apply the patch, build the jar and > try it out, please do and let us know. > > For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if you > can comment on the JIRA and post the sample docs, solr logs, relevant > information, I can give it a thorough look. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com> > wrote: > > > Hi all, > > > > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR > bug > > fixes and features added that would finally let us be able to make use of > > it (bi-directional syncing was the big one). The first time we tried to > > implement we ran into all kinds of errors, but this time we were able to > > get it mostly working. > > > > The issue we seem to be having now is that any time a document is deleted > > via deleteById from a collection on the primary node, we are flooded with > > "Invalid Number" errors followed by a random sequence of characters when > > CDCR tries to sync the update to the backup site. This happens on all of > > our collections where our id fields are defined as longs (some of them > the > > ids are compound keys and are strings). > > > > Here's a sample exception: > > > > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error > > from server at http://ip/solr/collection_shard1_replica_n1: Invalid > > Number: ] > > -s > > at > > org.apache.solr.client.solrj.impl.CloudSolrClient. > > directUpdate(CloudSolrClient.java:549) > > at > > org.apache.solr.client.solrj.impl.CloudSolrClient. > > sendRequest(CloudSolrClient.java:1012) > > at > > org.apache.solr.client.solrj.impl.CloudSolrClient. > > requestWithRetryOnStaleState(CloudSolrClient.java:883) > > at > > org.apache.solr.client.solrj.impl.CloudSolrClient. > > requestWithRetryOnStaleState(CloudSolrClient.java:945) > > at > > org.apache.solr.client.solrj.impl.CloudSolrClient. > > requestWithRetryOnStaleState(CloudSolrClient.java:945) > > at > > org.apache.solr.client.solrj.imp
CDCR Invalid Number on deletes
Hi all, We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR bug fixes and features added that would finally let us be able to make use of it (bi-directional syncing was the big one). The first time we tried to implement we ran into all kinds of errors, but this time we were able to get it mostly working. The issue we seem to be having now is that any time a document is deleted via deleteById from a collection on the primary node, we are flooded with "Invalid Number" errors followed by a random sequence of characters when CDCR tries to sync the update to the backup site. This happens on all of our collections where our id fields are defined as longs (some of them the ids are compound keys and are strings). Here's a sample exception: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at http://ip/solr/collection_shard1_replica_n1: Invalid Number: ] -s at org.apache.solr.client.solrj.impl.CloudSolrClient.directUpdate(CloudSolrClient.java:549) at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1012) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:883) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:816) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) at org.apache.solr.handler.CdcrReplicator.sendRequest(CdcrReplicator.java:140) at org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:104) at org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I'm scratching my head as to the cause of this. It's like it is trying to deleteById for the value "]", even though that is not the ID for the document that was deleted from the primary. So I don't know if it is pulling this from the wrong field somehow or where that value if coming from. I found this issue: https://issues.apache.org/jira/browse/SOLR-9394 which looks related, but doesn't look like it has any traction. Has anyone else experienced this issue with CDCR, or have any ideas as to what could be causing this issue? Thanks, Chris
Re: Long blocking during indexing + deleteByQuery
I've noticed something weird since implementing the change Shawn suggested, I wonder if someone can shed some light on it: Since changing from delete by query _root_:.. to querying for ids _root_: and then deleteById(ids from root query), we have started to notice some facet counts for child document facets not matching the actual query results. For example, facet shows a count of 10, click on the facet which applies a FQ with block join to return parent docs, and the number of results is less than the facet count, when they should match (facet count is doing a unique(_root_) so is only counting parents). I suspect that this may be somehow caused by orphaned child documents since the delete process changed. Does anyone know if changing from a DBQ: _root_ to the aforementioned querying for ids _root_ and delete by id would cause any issues with deleting child documents? Just trying manually it seems to work fine, but something is going on in some of our test environments. Thanks, Chris On Thu, Nov 9, 2017 at 2:52 PM, Chris Troullis <cptroul...@gmail.com> wrote: > Thanks Mike, I will experiment with that and see if it does anything for > this particular issue. > > I implemented Shawn's workaround and the problem has gone away, so that is > good at least for the time being. > > Do we think that this is something that should be tracked in JIRA for 6.X? > Or should I confirm if it is still happening in 7.X before logging anything? > > On Wed, Nov 8, 2017 at 6:23 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> I'm not sure this is what's affecting you, but you might try upgrading to >> Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple >> threads to resolve deletions: >> http://blog.mikemccandless.com/2017/07/lucene-gets-concurren >> t-deletes-and.html >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Tue, Nov 7, 2017 at 2:26 PM, Chris Troullis <cptroul...@gmail.com> >> wrote: >> >> > @Erick, I see, thanks for the clarification. >> > >> > @Shawn, Good idea for the workaround! I will try that and see if it >> > resolves the issue. >> > >> > Thanks, >> > >> > Chris >> > >> > On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson <erickerick...@gmail.com >> > >> > wrote: >> > >> > > bq: you think it is caused by the DBQ deleting a document while a >> > > document with that same ID >> > > >> > > No. I'm saying that DBQ has no idea _if_ that would be the case so >> > > can't carry out the operations in parallel because it _might_ be the >> > > case. >> > > >> > > Shawn: >> > > >> > > IIUC, here's the problem. For deleteById, I can guarantee the >> > > sequencing through the same optimistic locking that regular updates >> > > use (i.e. the _version_ field). But I'm kind of guessing here. >> > > >> > > Best, >> > > Erick >> > > >> > > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey <apa...@elyograg.org> >> > wrote: >> > > > On 11/5/2017 12:20 PM, Chris Troullis wrote: >> > > >> The issue I am seeing is when some >> > > >> threads are adding/updating documents while other threads are >> issuing >> > > >> deletes (using deleteByQuery), solr seems to get into a state of >> > extreme >> > > >> blocking on the replica >> > > > >> > > > The deleteByQuery operation cannot coexist very well with other >> > indexing >> > > > operations. Let me tell you about something I discovered. I think >> > your >> > > > problem is very similar. >> > > > >> > > > Solr 4.0 and later is supposed to be able to handle indexing >> operations >> > > > at the same time that the index is being optimized (in Lucene, >> > > > forceMerge). I have some indexes that take about two hours to >> > optimize, >> > > > so having indexing stop while that happens is a less than ideal >> > > > situation. Ongoing indexing is similar in many ways to a merge, >> enough >> > > > that it is handled by the same Merge Scheduler that handles an >> > optimize. >> > > > >> > > > I could indeed add documents to the index without issues at the same >> > > > time as an optimize, but when I would try my full indexing cycle >> while >> > > > an optimize was underway, I found that all operations stopped until >> the >> > > > optimize finished. >> > > > >> > > > Ultimately what was determined (I think it was Yonik that figured it >> > > > out) was that *most* indexing operations can happen during the >> > optimize, >> > > > *except* for deleteByQuery. The deleteById operation works just >> fine. >> > > > >> > > > I do not understand the low-level reasons for this, but apparently >> it's >> > > > not something that can be easily fixed. >> > > > >> > > > A workaround is to send the query you plan to use with >> deleteByQuery as >> > > > a standard query with a limited fl parameter, to retrieve matching >> > > > uniqueKey values from the index, then do a deleteById with that >> list of >> > > > ID values instead. >> > > > >> > > > Thanks, >> > > > Shawn >> > > > >> > > >> > >> > >
Re: Long blocking during indexing + deleteByQuery
Thanks Mike, I will experiment with that and see if it does anything for this particular issue. I implemented Shawn's workaround and the problem has gone away, so that is good at least for the time being. Do we think that this is something that should be tracked in JIRA for 6.X? Or should I confirm if it is still happening in 7.X before logging anything? On Wed, Nov 8, 2017 at 6:23 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > I'm not sure this is what's affecting you, but you might try upgrading to > Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple > threads to resolve deletions: > http://blog.mikemccandless.com/2017/07/lucene-gets- > concurrent-deletes-and.html > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Nov 7, 2017 at 2:26 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > > > @Erick, I see, thanks for the clarification. > > > > @Shawn, Good idea for the workaround! I will try that and see if it > > resolves the issue. > > > > Thanks, > > > > Chris > > > > On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > bq: you think it is caused by the DBQ deleting a document while a > > > document with that same ID > > > > > > No. I'm saying that DBQ has no idea _if_ that would be the case so > > > can't carry out the operations in parallel because it _might_ be the > > > case. > > > > > > Shawn: > > > > > > IIUC, here's the problem. For deleteById, I can guarantee the > > > sequencing through the same optimistic locking that regular updates > > > use (i.e. the _version_ field). But I'm kind of guessing here. > > > > > > Best, > > > Erick > > > > > > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey <apa...@elyograg.org> > > wrote: > > > > On 11/5/2017 12:20 PM, Chris Troullis wrote: > > > >> The issue I am seeing is when some > > > >> threads are adding/updating documents while other threads are > issuing > > > >> deletes (using deleteByQuery), solr seems to get into a state of > > extreme > > > >> blocking on the replica > > > > > > > > The deleteByQuery operation cannot coexist very well with other > > indexing > > > > operations. Let me tell you about something I discovered. I think > > your > > > > problem is very similar. > > > > > > > > Solr 4.0 and later is supposed to be able to handle indexing > operations > > > > at the same time that the index is being optimized (in Lucene, > > > > forceMerge). I have some indexes that take about two hours to > > optimize, > > > > so having indexing stop while that happens is a less than ideal > > > > situation. Ongoing indexing is similar in many ways to a merge, > enough > > > > that it is handled by the same Merge Scheduler that handles an > > optimize. > > > > > > > > I could indeed add documents to the index without issues at the same > > > > time as an optimize, but when I would try my full indexing cycle > while > > > > an optimize was underway, I found that all operations stopped until > the > > > > optimize finished. > > > > > > > > Ultimately what was determined (I think it was Yonik that figured it > > > > out) was that *most* indexing operations can happen during the > > optimize, > > > > *except* for deleteByQuery. The deleteById operation works just > fine. > > > > > > > > I do not understand the low-level reasons for this, but apparently > it's > > > > not something that can be easily fixed. > > > > > > > > A workaround is to send the query you plan to use with deleteByQuery > as > > > > a standard query with a limited fl parameter, to retrieve matching > > > > uniqueKey values from the index, then do a deleteById with that list > of > > > > ID values instead. > > > > > > > > Thanks, > > > > Shawn > > > > > > > > > >
Re: Long blocking during indexing + deleteByQuery
@Erick, I see, thanks for the clarification. @Shawn, Good idea for the workaround! I will try that and see if it resolves the issue. Thanks, Chris On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: you think it is caused by the DBQ deleting a document while a > document with that same ID > > No. I'm saying that DBQ has no idea _if_ that would be the case so > can't carry out the operations in parallel because it _might_ be the > case. > > Shawn: > > IIUC, here's the problem. For deleteById, I can guarantee the > sequencing through the same optimistic locking that regular updates > use (i.e. the _version_ field). But I'm kind of guessing here. > > Best, > Erick > > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 11/5/2017 12:20 PM, Chris Troullis wrote: > >> The issue I am seeing is when some > >> threads are adding/updating documents while other threads are issuing > >> deletes (using deleteByQuery), solr seems to get into a state of extreme > >> blocking on the replica > > > > The deleteByQuery operation cannot coexist very well with other indexing > > operations. Let me tell you about something I discovered. I think your > > problem is very similar. > > > > Solr 4.0 and later is supposed to be able to handle indexing operations > > at the same time that the index is being optimized (in Lucene, > > forceMerge). I have some indexes that take about two hours to optimize, > > so having indexing stop while that happens is a less than ideal > > situation. Ongoing indexing is similar in many ways to a merge, enough > > that it is handled by the same Merge Scheduler that handles an optimize. > > > > I could indeed add documents to the index without issues at the same > > time as an optimize, but when I would try my full indexing cycle while > > an optimize was underway, I found that all operations stopped until the > > optimize finished. > > > > Ultimately what was determined (I think it was Yonik that figured it > > out) was that *most* indexing operations can happen during the optimize, > > *except* for deleteByQuery. The deleteById operation works just fine. > > > > I do not understand the low-level reasons for this, but apparently it's > > not something that can be easily fixed. > > > > A workaround is to send the query you plan to use with deleteByQuery as > > a standard query with a limited fl parameter, to retrieve matching > > uniqueKey values from the index, then do a deleteById with that list of > > ID values instead. > > > > Thanks, > > Shawn > > >
Re: Long blocking during indexing + deleteByQuery
If I am understanding you correctly, you think it is caused by the DBQ deleting a document while a document with that same ID is being updated by another thread? I'm not sure that is what is happening here, as we only delete docs if they no longer exist in the DB, so nothing should be adding/updating a doc with that ID if it is marked for deletion, as we don't reuse IDs. I will double check though to confirm. Also, not sure if relevant, but the DBQ itself returns very quickly, in a matter of ms, it's the updates that block for a huge amount of time. On Tue, Nov 7, 2017 at 11:08 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > Maybe not a relevant fact on this, but: "addAndDelete" is triggered by > "*Reordering > of DBQs'; *that means there are non-executed DBQs present in the updateLog > and an add operation is also received. Solr makes sure DBQs are executed > first and than add operation is executed. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Tue, Nov 7, 2017 at 9:19 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Well, consider what happens here. > > > > Solr gets a DBQ that includes document 132 and 10,000,000 other docs > > Solr gets an add for document 132 > > > > The DBQ takes time to execute. If it was processing the requests in > > parallel would 132 be in the index after the delete was over? It would > > depend on when the DBQ found the doc relative to the add. > > With this sequence one would expect 132 to be in the index at the end. > > > > And it's worse when it comes to distributed indexes. If the updates > > were sent out in parallel you could end up in situations where one > > replica contained 132 and another didn't depending on the vagaries of > > thread execution. > > > > Now I didn't write the DBQ code, but that's what I think is happening. > > > > Best, > > Erick > > > > On Tue, Nov 7, 2017 at 7:40 AM, Chris Troullis <cptroul...@gmail.com> > > wrote: > > > As an update, I have confirmed that it doesn't seem to have anything to > > do > > > with child documents, or standard deletes, just deleteByQuery. If I do > a > > > deleteByQuery on any collection while also adding/updating in separate > > > threads I am experiencing this blocking behavior on the non-leader > > replica. > > > > > > Has anyone else experienced this/have any thoughts on what to try? > > > > > > On Sun, Nov 5, 2017 at 2:20 PM, Chris Troullis <cptroul...@gmail.com> > > wrote: > > > > > >> Hi, > > >> > > >> I am experiencing an issue where threads are blocking for an extremely > > >> long time when I am indexing while deleteByQuery is also running. > > >> > > >> Setup info: > > >> -Solr Cloud 6.6.0 > > >> -Simple 2 Node, 1 Shard, 2 replica setup > > >> -~12 million docs in the collection in question > > >> -Nodes have 64 GB RAM, 8 CPUs, spinning disks > > >> -Soft commit interval 10 seconds, Hard commit (open searcher false) 60 > > >> seconds > > >> -Default merge policy settings (Which I think is 10/10). > > >> > > >> We have a query heavy index heavyish use case. Indexing is constantly > > >> running throughout the day and can be bursty. The indexing process > > handles > > >> both updates and deletes, can spin up to 15 simultaneous threads, and > > sends > > >> to solr in batches of 3000 (seems to be the optimal number per trial > and > > >> error). > > >> > > >> I can build the entire collection from scratch using this method in < > 40 > > >> mins and indexing is in general super fast (averages about 3 seconds > to > > >> send a batch of 3000 docs to solr). The issue I am seeing is when some > > >> threads are adding/updating documents while other threads are issuing > > >> deletes (using deleteByQuery), solr seems to get into a state of > extreme > > >> blocking on the replica, which results in some threads taking 30+ > > minutes > > >> just to send 1 batch of 3000 docs. This collection does use child > > documents > > >> (hence the delete by query _root_), not sure if that makes a > > difference, I > > >> am trying to duplicate on a non-child doc collection. CPU/IO wait > seems &g
Re: Long blocking during indexing + deleteByQuery
As an update, I have confirmed that it doesn't seem to have anything to do with child documents, or standard deletes, just deleteByQuery. If I do a deleteByQuery on any collection while also adding/updating in separate threads I am experiencing this blocking behavior on the non-leader replica. Has anyone else experienced this/have any thoughts on what to try? On Sun, Nov 5, 2017 at 2:20 PM, Chris Troullis <cptroul...@gmail.com> wrote: > Hi, > > I am experiencing an issue where threads are blocking for an extremely > long time when I am indexing while deleteByQuery is also running. > > Setup info: > -Solr Cloud 6.6.0 > -Simple 2 Node, 1 Shard, 2 replica setup > -~12 million docs in the collection in question > -Nodes have 64 GB RAM, 8 CPUs, spinning disks > -Soft commit interval 10 seconds, Hard commit (open searcher false) 60 > seconds > -Default merge policy settings (Which I think is 10/10). > > We have a query heavy index heavyish use case. Indexing is constantly > running throughout the day and can be bursty. The indexing process handles > both updates and deletes, can spin up to 15 simultaneous threads, and sends > to solr in batches of 3000 (seems to be the optimal number per trial and > error). > > I can build the entire collection from scratch using this method in < 40 > mins and indexing is in general super fast (averages about 3 seconds to > send a batch of 3000 docs to solr). The issue I am seeing is when some > threads are adding/updating documents while other threads are issuing > deletes (using deleteByQuery), solr seems to get into a state of extreme > blocking on the replica, which results in some threads taking 30+ minutes > just to send 1 batch of 3000 docs. This collection does use child documents > (hence the delete by query _root_), not sure if that makes a difference, I > am trying to duplicate on a non-child doc collection. CPU/IO wait seems > minimal on both nodes, so not sure what is causing the blocking. > > Here is part of the stack trace on one of the blocked threads on the > replica: > > qtp592179046-576 (576) > java.lang.Object@608fe9b5 > org.apache.solr.update.DirectUpdateHandler2.addAndDelete( > DirectUpdateHandler2.java:354) > org.apache.solr.update.DirectUpdateHandler2.addDoc0( > DirectUpdateHandler2.java:237) > org.apache.solr.update.DirectUpdateHandler2.addDoc( > DirectUpdateHandler2.java:194) > org.apache.solr.update.processor.RunUpdateProcessor.processAdd( > RunUpdateProcessorFactory.java:67) > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd( > UpdateRequestProcessor.java:55) > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd( > DistributedUpdateProcessor.java:979) > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd( > DistributedUpdateProcessor.java:1192) > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd( > DistributedUpdateProcessor.java:748) > org.apache.solr.handler.loader.JavabinLoader$1.update > (JavabinLoader.java:98) > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1. > readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180) > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1. > readIterator(JavaBinUpdateRequestCodec.java:136) > org.apache.solr.common.util.JavaBinCodec.readObject( > JavaBinCodec.java:306) > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251) > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1. > readNamedList(JavaBinUpdateRequestCodec.java:122) > org.apache.solr.common.util.JavaBinCodec.readObject( > JavaBinCodec.java:271) > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251) > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173) > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal( > JavaBinUpdateRequestCodec.java:187) > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs( > JavabinLoader.java:108) > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55) > org.apache.solr.handler.UpdateRequestHandler$1.load( > UpdateRequestHandler.java:97) > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody( > ContentStreamHandlerBase.java:68) > org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:173) > org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > > A cursory search lead me to this JIRA https://issues.apache. > org/jira/browse/SOLR-7836, not sure if related though. > > Can anyone shed some light on this issue? We don't do deletes very > frequently, but it is bringing solr to it's knees when we do, which is > causing some big problems. > > Thanks, > > Chris >
Long blocking during indexing + deleteByQuery
Hi, I am experiencing an issue where threads are blocking for an extremely long time when I am indexing while deleteByQuery is also running. Setup info: -Solr Cloud 6.6.0 -Simple 2 Node, 1 Shard, 2 replica setup -~12 million docs in the collection in question -Nodes have 64 GB RAM, 8 CPUs, spinning disks -Soft commit interval 10 seconds, Hard commit (open searcher false) 60 seconds -Default merge policy settings (Which I think is 10/10). We have a query heavy index heavyish use case. Indexing is constantly running throughout the day and can be bursty. The indexing process handles both updates and deletes, can spin up to 15 simultaneous threads, and sends to solr in batches of 3000 (seems to be the optimal number per trial and error). I can build the entire collection from scratch using this method in < 40 mins and indexing is in general super fast (averages about 3 seconds to send a batch of 3000 docs to solr). The issue I am seeing is when some threads are adding/updating documents while other threads are issuing deletes (using deleteByQuery), solr seems to get into a state of extreme blocking on the replica, which results in some threads taking 30+ minutes just to send 1 batch of 3000 docs. This collection does use child documents (hence the delete by query _root_), not sure if that makes a difference, I am trying to duplicate on a non-child doc collection. CPU/IO wait seems minimal on both nodes, so not sure what is causing the blocking. Here is part of the stack trace on one of the blocked threads on the replica: qtp592179046-576 (576) java.lang.Object@608fe9b5 org.apache.solr.update.DirectUpdateHandler2.addAndDelete(DirectUpdateHandler2.java:354) org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:237) org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194) org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748) org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136) org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:306) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:122) org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:271) org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:187) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:108) org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55) org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) A cursory search lead me to this JIRA https://issues.apache.org/jira/browse/SOLR-7836, not sure if related though. Can anyone shed some light on this issue? We don't do deletes very frequently, but it is bringing solr to it's knees when we do, which is causing some big problems. Thanks, Chris
Re: Inconsistency in results between replicas using CloudSolrClient
Thanks for the reply Erick, I feared that would be the case. Interesting idea with using the fq but not sure I like the performance implications. I will see how big of a deal it will be in practice, I was just thinking about this as a hypothetical scenario today, and as you said, we have a lot of automated tests so I anticipate this likely causing issues. I'll give it some more thought and see if I can come up with any other workarounds. -Chris On Tue, Aug 1, 2017 at 5:38 PM, Erick Erickson <erickerick...@gmail.com> wrote: > You're understanding is correct. > > As for how people cope? Mostly they ignore it. The actual number of > times people notice this is usually quite small, mostly it surfaces > when automated test suites are run. > > If you must lock this up, and you can stand the latency you could add > a timestamp for each document and auto-add an FQ clause like: > fq=timestamp:[* TO NOW-soft_commit_interval_plus_some_windage] > > Note, though, that this not an fq clause that can be re-used, see: > https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/ so > either it'd be something like: > fq=timestamp:[* TO NOW/MINUTE-soft_commit_interval_plus_some_windage] > or > fq=timestamp:{!cache=false}[* TO NOW-soft_commit_interval_plus_ > some_windage] > > and would inevitably make the latency between when something was > indexed and available for search longer. > > You can also reduce your soft commit interval to something short, but > that has other problems. > > see: SOLR-6606, but it looks like other priorities have gotten in the > way of it being committed. > > Best, > Erick > > On Tue, Aug 1, 2017 at 1:50 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > > Hi, > > > > I think I know the answer to this question, but just wanted to verify/see > > what other people do to address this concern. > > > > I have a Solr Cloud setup (6.6.0) with 2 nodes, 1 collection with 1 shard > > and 2 replicas (1 replica per node). The nature of my use case requires > > frequent updates to Solr, and documents are being added constantly > > throughout the day. I am using CloudSolrClient via SolrJ to query my > > collection and load balance across my 2 replicas. > > > > Here's my question: > > > > As I understand it, because of the nature of Solr Cloud (eventual > > consistency), and the fact that the soft commit timings on the 2 replicas > > will not necessarily be in sync, would it not be possible to run into a > > scenario where, say a document gets indexed on replica 1 right before a > > soft commit, but indexed on replica 2 right after a soft commit? In this > > scenario, using the load balanced CloudSolrClient, wouldn't it be > possible > > for a user to do a search, see the newly added document because they got > > sent to replica 1, and then search again, and the newly added document > > would disappear from their results since they got sent to replica 2 and > the > > soft commit hasn't happened yet? > > > > If so, how do people typically handle this scenario in NRT search cases? > It > > seems like a poor user experience if things keep disappearing and > > reappearing from their search results randomly. Currently the only > thought > > I have to prevent this is to write (or extend) my own solr client to > stick > > a user's session to a specific replica (unless it goes down), but still > > load balance users between the replicas. But of course then I have to > > manage all of the things CloudSolrClient manages manually re: cluster > > state, etc. > > > > Can anyone confirm/deny my understanding of how this works/offer any > > suggestions to eliminate the scenario in question from occurring? > > > > Thanks, > > > > Chris >
Inconsistency in results between replicas using CloudSolrClient
Hi, I think I know the answer to this question, but just wanted to verify/see what other people do to address this concern. I have a Solr Cloud setup (6.6.0) with 2 nodes, 1 collection with 1 shard and 2 replicas (1 replica per node). The nature of my use case requires frequent updates to Solr, and documents are being added constantly throughout the day. I am using CloudSolrClient via SolrJ to query my collection and load balance across my 2 replicas. Here's my question: As I understand it, because of the nature of Solr Cloud (eventual consistency), and the fact that the soft commit timings on the 2 replicas will not necessarily be in sync, would it not be possible to run into a scenario where, say a document gets indexed on replica 1 right before a soft commit, but indexed on replica 2 right after a soft commit? In this scenario, using the load balanced CloudSolrClient, wouldn't it be possible for a user to do a search, see the newly added document because they got sent to replica 1, and then search again, and the newly added document would disappear from their results since they got sent to replica 2 and the soft commit hasn't happened yet? If so, how do people typically handle this scenario in NRT search cases? It seems like a poor user experience if things keep disappearing and reappearing from their search results randomly. Currently the only thought I have to prevent this is to write (or extend) my own solr client to stick a user's session to a specific replica (unless it goes down), but still load balance users between the replicas. But of course then I have to manage all of the things CloudSolrClient manages manually re: cluster state, etc. Can anyone confirm/deny my understanding of how this works/offer any suggestions to eliminate the scenario in question from occurring? Thanks, Chris
Re: Seeing odd behavior with implicit routing
Shalin, Thanks for the response and explanation! I logged a JIRA per your request here: https://issues.apache.org/jira/browse/SOLR-10695 Chris On Mon, May 15, 2017 at 3:40 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Sun, May 14, 2017 at 7:40 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > > Hi, > > > > I've been experimenting with various sharding strategies with Solr cloud > > (6.5.1), and am seeing some odd behavior when using the implicit router. > I > > am probably either doing something wrong or misinterpreting what I am > > seeing in the logs, but if someone could help clarify that would be > awesome. > > > > I created a collection using the implicit router, created 10 shards, > named > > shard1, shard2, etc. I indexed 3000 documents to each shard, routed by > > setting the _route_ field on the documents in my schema. All works fine, > I > > verified there are 3000 documents in each shard. > > > > The odd behavior I am seeing is when I try to route a query to a specific > > shard. I submitted a simple query to shard1 using the request parameter > > _route_=shard1. The query comes back fine, but when I looked in the logs, > > it looked like it was issuing 3 separate requests: > > > > 1. The original query to shard1 > > 2. A 2nd query to shard1 with the parameter ids=a bunch of document ids > > 3. The original query to a random shard (changes every time I run the > query) > > > > It looks like the first query is getting back a list of ids, and the 2nd > > query is retrieving the documents for those ids? I assume this is some > solr > > cloud implementation detail. > > > > What I don't understand is the 3rd query. Why is it issuing the original > > query to a random shard every time, when I am specifying the _route_? The > > _route_ parameter is definitely doing something, because if I remove it, > it > > is querying all shards (which I would expect). > > > > Any ideas? I can provide the actual queries from the logs if required. > > How many nodes is this collection distributed across? I suspect that > you are using a single node for experimentation? > > What happens with _route_=shard1 parameter and implicit routing is > that the _route_ parameter is resolved to a list of replicas of > shard1. But, SolrJ uses only the node name of the replica along with > the collection name to make the request (this is important, we'll come > back to this later). So, ordinarily, that node hosts a single shard > (shard1) and when it receives the request, it will optimize the search > to go the non-distributed code path (since the replica has all the > data needed to satisfy the search). > > But interesting things happen when the node hosts more than one shard > (say shard1 and shard3 both). When we query such a node using just the > collection name, the collection name can be resolved to either shard1 > or shard3 -- this is picked randomly without looking at _route_ > parameter at all. If shard3 is picked, it looks at the request, sees > that it doesn't have all the necessary data and decides to follow the > two-phase distributed search path where phase 1 is to get the ids and > score of the documents matching the query from all participating > shards (the list of such shards is limited by _route_ parameter, which > in our case will be only shard1) and a second phase where we get the > actual stored fields to be returned to the user. So you get three > queries in the log, 1) phase 1 of distributed search hitting shard1, > 2) phase two of distributed search hitting shard1 and 3) the > distributed scatter-gather search run by shard3. > > So to recap, this is happening because you have more than one shard1 > hosted on a node. Easy workaround is to have each shard hosted on a > unique node. But we can improve things on the solr side as well by 1) > having SolrJ resolve requests down to node name and core name, 2) > having the collection name to core name resolution take _route_ param > into account. Both 1 and 2 can solve the problem. Can you please open > a Jira issue? > > > > > Thanks, > > > > Chris > > > > -- > Regards, > Shalin Shekhar Mangar. >
Seeing odd behavior with implicit routing
Hi, I've been experimenting with various sharding strategies with Solr cloud (6.5.1), and am seeing some odd behavior when using the implicit router. I am probably either doing something wrong or misinterpreting what I am seeing in the logs, but if someone could help clarify that would be awesome. I created a collection using the implicit router, created 10 shards, named shard1, shard2, etc. I indexed 3000 documents to each shard, routed by setting the _route_ field on the documents in my schema. All works fine, I verified there are 3000 documents in each shard. The odd behavior I am seeing is when I try to route a query to a specific shard. I submitted a simple query to shard1 using the request parameter _route_=shard1. The query comes back fine, but when I looked in the logs, it looked like it was issuing 3 separate requests: 1. The original query to shard1 2. A 2nd query to shard1 with the parameter ids=a bunch of document ids 3. The original query to a random shard (changes every time I run the query) It looks like the first query is getting back a list of ids, and the 2nd query is retrieving the documents for those ids? I assume this is some solr cloud implementation detail. What I don't understand is the 3rd query. Why is it issuing the original query to a random shard every time, when I am specifying the _route_? The _route_ parameter is definitely doing something, because if I remove it, it is querying all shards (which I would expect). Any ideas? I can provide the actual queries from the logs if required. Thanks, Chris
Re: Multiple collections vs multiple shards for multitenancy
Thanks for the great advice Erick. I will experiment with your suggestions and see how it goes! Chris On Sun, May 7, 2017 at 12:34 AM, Erick Erickson <erickerick...@gmail.com> wrote: > Well, you've been doing your homework ;). > > bq: I am a little confused on this statement you made: > > > Plus you can't commit > > individually, a commit on one will _still_ commit on all so you're > > right back where you started. > > Never mind. autocommit kicks off on a per replica basis. IOW, when a > new doc is indexed to a shard (really, any replica) the timer is > started. So if replica 1_1 gets a doc and replica 2_1 doesn't, there > is no commit on replica 2_1. My comment was mainly directed at the > idea that you might issue commits from the client, which are > distributed to all replicas. However, even in that case the a replica > that has received no updates won't do anything. > > About the hybrid approach. I've seen situations where essentially you > partition clients along "size" lines. So something like "put clients > on a shared single-shard collection as long as the aggregate number of > records is < X". The theory is that the update frequency is roughly > the same if you have 10 clients with 100K docs each .vs. one client > with 1M docs. So the pain of opening a new searcher is roughly the > same. "X" here is experimentally determined. > > Do note that moving from master/slave to SolrCloud will reduce > latency. In M/S, the time it takes to search is autocommit + poling > interval + autowarm time. Going to SolrCloud will remove the "polling > interval" from the equation. Not sure how much that helps > > There should be an autowarm statistic in the Solr logs BTW. Or some > messages about "opening searcher (some hex stuff) and another message > about when it's registered as active, along with timestamps. That'll > tell you how long it takes to autowarm. > > OK. "straw man" strategy for your case. Create a collection per > tenant. What you want to balance is where the collections are hosted. > Host a number of small tenants on the same Solr instance and fewer > larger tenants on other hardware. FWIW, I expect at least 25M docs per > Solr JVM (very hardware dependent of course), although testing is > important. > > Under the covers, each Solr instance establishes "watchers" on the > collections it hosts. So if a particular Solr hosts replicas for, say, > 10 collections, it establishes 10 watchers on the state.json zNode in > Zookeeper. 300 collections isn't all that much in recent Solr > installations. All that filtered through how beefy your hardware is of > course. > > Startup is an interesting case, but I've put 1,600 replicas on 4 Solr > instance on a Mac Pro (400 each). You can configure the number of > startup threads if starting up is too painful. > > So a cluster with 300 collections isn't really straining things. Some > of the literature is talking about thousands of collections. > > Good luck! > Erick > > On Sat, May 6, 2017 at 4:26 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > > Hi Erick, > > > > Thanks for the reply, I really appreciate it. > > > > To answer your questions, we have a little over 300 tenants, and a couple > > of different collections, the largest of which has ~11 million documents > > (so not terribly large). We are currently running standard Solr with > simple > > master/slave replication, so all of the documents are in a single solr > > core. We are planning to move to Solr cloud for various reasons, and as > > discussed previously, I am trying to find the best way to distribute the > > documents to serve a more NRT focused search case. > > > > I totally get your point on pushing back on NRT requirements, and I have > > done so for as long as I can. Currently our auto softcommit is set to 1 > > minute and we are able to achieve great query times with autowarming. > > Unfortunately, due to the nature of our application, our customers expect > > any changes they make to be visible almost immediately in search, and we > > have recently been getting a lot of complaints in this area, leading to > an > > initiative to drive down the time it takes for documents to become > visible > > in search. Which leaves me where I am now, trying to find the right > balance > > between document visibility and reasonable, stable, query times. > > > > Regarding autowarming, our autowarming times aren't too crazy. We are > > warming a max of 100 entries from the filter cache and it takes around > 5-10 > > seconds to complete on average. I suspect our biggest
Re: Multiple collections vs multiple shards for multitenancy
gt; feeding... > > Sharding a single large collection and using custom routing to push > tenants to a single shard will be an administrative problem for you. > I'm assuming you have the typical multi-tenant problems, a bunch of > tenants have around N docs, some smaller percentage have 3N and a few > have 100N. Now you're having to keep track of how many docs are on > each shard, do the routing yourself, etc. Plus you can't commit > individually, a commit on one will _still_ commit on all so you're > right back where you started. > > I've seen people use a hybrid approach: experiment with how many > _documents_ you can have in a collection (however you partition that > up) and use the multi-tenant approach. So you have N collections and > each collection has a (varying) number of tenants. This also tends to > flatten out the update process on the assumption that your smaller > tenants also don't update their data as often. > > However, I really have to question one of your basic statements: > > "This works fine with aggressive autowarming, but I have a need to reduce > my NRT > search capabilities to seconds as opposed to the minutes it is at now,"... > > The implication here is that your autowarming takes minutes. Very > often people severely overdo the warmup by setting their autowarm > counts to 100s or 1000s. This is rarely necessary, especially if you > use docValues fields appropriately. Very often much of autowarming is > "uninverting" fields (look in your Solr log). Essentially for any > field you see this, use docValues and loading will be much faster. > > You also haven't said how many documents you have in a shard at > present. This is actually the metric I use most often to size > hardware. I claim you can find a sweet spot where minimal autowarming > will give you good enough performance, and that number is what you > should design to. Of course YMMV. > > Finally: push back really hard on how aggressive NRT support needs to > be. Often "requirements" like this are made without much thought as > "faster is better, let's make it 1 second!". There are situations > where that's true, but it comes at a cost. Users may be better served > by a predictable but fast system than one that's fast but > unpredictable. "Documents may take up to 5 minutes to appear and > searches will usually take less than a second" is nice and concise. I > have my expectations. "Documents are searchable in 1 second, but the > results may not come back for between 1 and 10 seconds" is much more > frustrating. > > FWIW, > Erick > > On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> > wrote: > > Hi, > > > > I use Solr to serve multiple tenants and currently all tenant's data > > resides in one large collection, and queries have a tenant identifier. > This > > works fine with aggressive autowarming, but I have a need to reduce my > NRT > > search capabilities to seconds as opposed to the minutes it is at now, > > which will mean drastically reducing if not eliminating my autowarming. > As > > such I am considering splitting my index out by tenant so that when one > > tenant modifies their data it doesn't blow away all of the searcher based > > caches for all tenants on soft commit. > > > > I have done a lot of research on the subject and it seems like Solr Cloud > > can have problems handling large numbers of collections. I'm obviously > > going to have to run some tests to see how it performs, but my main > > question is this: are there pros and cons to splitting the index into > > multiple collections vs having 1 collection but splitting into multiple > > shards? In my case I would have a shard per tenant and use implicit > routing > > to route to that specific shard. As I understand it a shard is basically > > it's own lucene index, so I would still be eating that overhead with > either > > approach. What I don't know is if there are any other overheads involved > > WRT collections vs shards, routing, zookeeper, etc. > > > > Thanks, > > > > Chris >
Multiple collections vs multiple shards for multitenancy
Hi, I use Solr to serve multiple tenants and currently all tenant's data resides in one large collection, and queries have a tenant identifier. This works fine with aggressive autowarming, but I have a need to reduce my NRT search capabilities to seconds as opposed to the minutes it is at now, which will mean drastically reducing if not eliminating my autowarming. As such I am considering splitting my index out by tenant so that when one tenant modifies their data it doesn't blow away all of the searcher based caches for all tenants on soft commit. I have done a lot of research on the subject and it seems like Solr Cloud can have problems handling large numbers of collections. I'm obviously going to have to run some tests to see how it performs, but my main question is this: are there pros and cons to splitting the index into multiple collections vs having 1 collection but splitting into multiple shards? In my case I would have a shard per tenant and use implicit routing to route to that specific shard. As I understand it a shard is basically it's own lucene index, so I would still be eating that overhead with either approach. What I don't know is if there are any other overheads involved WRT collections vs shards, routing, zookeeper, etc. Thanks, Chris
Sharding strategy for optimal NRT performance
Hi! I am looking for some advice on an sharding strategy that will produce optimal performance in the NRT search case for my setup. I have come up with a strategy that I think will work based on my experience, testing, and reading of similar questions on the mailing list, but I was hoping to run my idea by some experts to see if I am on the right track or am completely off base. *Let's start off with some background info on my use case*: We are currently using Solr (5.5.2) with the classic Master/Slave setup. Because of our NRT requirements, the slave is pretty much only used for failover, all writes/reads go to the master (which I know is not ideal, but that's what we're working with!). We have 6 different indexes with completely different schemas for various searches in our application. We have just over 300 tenants, which currently all reside within the same index for each of our indexes. We separate our tenants at query time via a filter query with a tenant identifier (which works fine). Each index is not tremendously large, they range from 1M documents to the largest being around 12M documents. Our load is not huge as search is not the core functionality of our application, but merely a tool to get to what they are looking for in the app. I believe our peak load barely goes over 1 QPS. Even though our number of documents isn't super high, we do some pretty complex faceting, and block joins in some cases, which along with crappy hardware in our data center (no SSDs) initially led to some pretty poor query times for our customers. This was due to the fact that we are constantly indexing throughout the day (job that runs once per minute), and we auto soft commit (openSearcher=true) every 1 minute. Because of the nature of our application, NRT updates are necessary. As we all know, opening searches this frequently has the drawback of invalidating all of our searcher-based caches, causing query times to be erratic, and slower on average. With our current setup, we have solved our query performance times by setting up autowarming, both on the filter cache, and via static warming queries. *The problem:* So now for the problem. While we are now running great from a performance perspective, we are receiving complaints from customers saying that the changes they are making are slow to be reflected in search. Because of the nature of our application, this has significant impact on their user experience, and is an issue we need to solve. Overall, we would like to be able to reduce our NRT visibility from the minutes we have now down to seconds. The problem is doing this in a way that won't significantly affect our query performance. We are already seeing maxWarmingSearchers warnings in our logs occasionally with our current setup, so just indexing more frequently is not a viable solution. In addition to this, autowarming in itself is problematic for the NRT use case, as the new searcher won't start serving requests until it is fully warmed anyway, which is sort of counter to the goal of decreasing the time it takes for new documents to be visible in search. And so this is the predicament we find ourselves in. We can index more frequently (and soft commit more frequently), but we will have to remove (or greatly decrease) our autowarming, which will destroy our search performance. Obviously there is some give and take here, we can't have true NRT search with optimal query performance, but I am hoping to find a solution that will provide acceptable results for both. *Proposed solution:* I have done a lot of research and experimentation on this issue and have started coming up with what I believe will be a decent solution to the aforementioned problem. First off, I would like to make the move over to Solr Cloud. We had been contemplating this for a while anyway, as we currently have no load balancing at all (since our slave is just used for failover), but I am also thinking that by using the right sharding strategy we can improve our NRT experience as well. I first started looking at the standard composite id routing, and while we can ensure that all of a single tenant's data is located on the same shard, because there is a large discrepancy between the amounts of data our tenants have, our shards would be very unevenly distributed in terms of number of documents. Ideally, we would like all of our tenants to be isolated from a performance perspective (from a security perspective we are not really concerned, as all of our queries have a tenant identifier filter query already). Basically, we don't want tiny tenant A to be screwed over because they were unlucky enough to land on Huge tenant B's shard. We do know the footprint of each tenant in terms of number of documents, so technically we could work out a sharding strategy manually which would evenly distribute the tenants based on number of documents, but since we have 6 different indexes, and with each index the tenant's document distribution will be