Re: maxChars no longer working on CopyField as of 7.7

2019-08-15 Thread Chris Troullis
Thanks Erik. I created SOLR-13699. I agree wrt adding a Unit Test, that was
my thinking as well. I am currently working on a test, and then I will
submit my patch.

Thanks,

Chris

On Thu, Aug 15, 2019 at 1:06 PM Erick Erickson 
wrote:

> Chris:
>
> I certainly don’t see anything in JIRA about this, so please do raise a
> JIRA,
> especially as you already have a patch!
>
> It’d be great if you added a test case demonstrating this that fails
> without
> your patch and succeeds after. I’d just add a method to one of the
> existing tests, maybe in
> solr/core/src/test/org/apache/solr/schema/CopyFieldTest.java?
>
> No need to make this a SolrCloud test.
>
> Best,
> Erick
>
> > On Aug 15, 2019, at 11:31 AM, Chris Troullis 
> wrote:
> >
> > Hi all,
> >
> > We recently upgraded from Solr 7.3 to 8.1, and noticed that the maxChars
> > property on a copy field is no longer functioning as designed. Per the
> most
> > recent documentation it looks like there have been no intentional changes
> > as to the functionality of this property, so I assume this is a bug.
> >
> > In debugging the issue, it looks like the bug was caused by SOLR-12992.
> In
> > DocumentBuilder where the maxChar limit is applied, it first checks if
> the
> > value is instanceof String. As of SOLR-12992, string values are now
> coming
> > in as ByteArrayUtf8CharSequence (unless they are above a certain size as
> > defined by JavaBinCodec.MAX_UTF8_SZ), so they are failing the instanceof
> > String check, and the maxChar truncation is not being applied.
> >
> > I went to log a bug but figured I would double check on here first just
> to
> > confirm that people think that this is actually a bug and I'm not going
> > crazy. Let me know what you think, and I will log the bug.
> >
> > I have implemented a fix which I am currently testing and will be happy
> to
> > submit a patch, assuming it's agreed that this is not intended behavior.
> >
> > Thanks,
> > Chris
>
>


maxChars no longer working on CopyField as of 7.7

2019-08-15 Thread Chris Troullis
Hi all,

We recently upgraded from Solr 7.3 to 8.1, and noticed that the maxChars
property on a copy field is no longer functioning as designed. Per the most
recent documentation it looks like there have been no intentional changes
as to the functionality of this property, so I assume this is a bug.

In debugging the issue, it looks like the bug was caused by SOLR-12992. In
DocumentBuilder where the maxChar limit is applied, it first checks if the
value is instanceof String. As of SOLR-12992, string values are now coming
in as ByteArrayUtf8CharSequence (unless they are above a certain size as
defined by JavaBinCodec.MAX_UTF8_SZ), so they are failing the instanceof
String check, and the maxChar truncation is not being applied.

I went to log a bug but figured I would double check on here first just to
confirm that people think that this is actually a bug and I'm not going
crazy. Let me know what you think, and I will log the bug.

I have implemented a fix which I am currently testing and will be happy to
submit a patch, assuming it's agreed that this is not intended behavior.

Thanks,
Chris


Re: Suggestions for debugging performance issue

2018-06-25 Thread Chris Troullis
FYI to all, just as an update, we rebuilt the index in question from
scratch for a second time this weekend and the problem went away on 1 node,
but we were still seeing it on the other node. After restarting the
problematic node, the problem went away. Still makes me a little uneasy as
we weren't able to determine the cause, but at least we are back to normal
query times now.

Chris

On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis 
wrote:

> Thanks Shawn,
>
> As mentioned previously, we are hard committing every 60 seconds, which we
> have been doing for years, and have had no issues until enabling CDCR. We
> have never seen large tlog sizes before, and even manually issuing a hard
> commit to the collection does not reduce the size of the tlogs. I believe
> this is because when using the CDCRUpdateLog the tlogs are not purged until
> the docs have been replicated over. Anyway, since we manually purged the
> tlogs they seem to now be staying at an acceptable size, so I don't think
> that is the cause. The documents are not abnormally large, maybe ~20
> string/numeric fields with simple whitespace tokenization.
>
> To answer your questions:
>
> -Solr version: 7.2.1
> -What OS vendor and version Solr is running on: CentOS 6
> -Total document count on the server (counting all index cores): 13
> collections totaling ~60 million docs
> -Total index size on the server (counting all cores): ~60GB
> -What the total of all Solr heaps on the server is - 16GB heap (we had to
> increase for CDCR because it was using a lot more heap).
> -Whether there is software other than Solr on the server - No
> -How much total memory the server has installed - 64 GB
>
> All of this has been consistent for multiple years across multiple Solr
> versions and we have only started seeing this issue once we started using
> the CDCRUpdateLog and CDCR, hence why that is the only real thing we can
> point to. And again, the issue is only affecting 1 of the 13 collections on
> the server, so if it was hardware/heap/GC related then I would think we
> would be seeing it for every collection, not just one, as they all share
> the same resources.
>
> I will take a look at the GC logs, but I don't think that is the cause.
> The consistent nature of the slow performance doesn't really point to GC
> issues, and we have profiling set up in New Relic and it does not show any
> long/frequent GC pauses.
>
> We are going to try and rebuild the collection from scratch again this
> weekend as that has solved the issue in some lower environments, although
> it's not really consistent. At this point it's all we can think of to do.
>
> Thanks,
>
> Chris
>
>
> On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey  wrote:
>
>> On 6/12/2018 12:06 PM, Chris Troullis wrote:
>> > The issue we are seeing is with 1 collection in particular, after we
>> set up
>> > CDCR, we are getting extremely slow response times when retrieving
>> > documents. Debugging the query shows QTime is almost nothing, but the
>> > overall responseTime is like 5x what it should be. The problem is
>> > exacerbated by larger result sizes. IE retrieving 25 results is almost
>> > normal, but 200 results is way slower than normal. I can run the exact
>> same
>> > query multiple times in a row (so everything should be cached), and I
>> still
>> > see response times way higher than another environment that is not using
>> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just
>> that
>> > we are using the CDCRUpdateLog. The problem started happening even
>> before
>> > we enabled CDCR.
>> >
>> > In a lower environment we noticed that the transaction logs were huge
>> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
>> > restarting, and that seemed to fix the performance issue. We tried the
>> same
>> > thing in production the other day but it had no effect, so now I don't
>> know
>> > if it was a coincidence or not.
>>
>> There is one other cause besides CDCR buffering that I know of for huge
>> transaction logs, and it has nothing to do with CDCR:  A lack of hard
>> commits.  It is strongly recommended to have autoCommit set to a
>> reasonably short interval (about a minute in my opinion, but 15 seconds
>> is VERY common).  Most of the time openSearcher should be set to false
>> in the autoCommit config, and other mechanisms (which might include
>> autoSoftCommit) should be used for change visibility.  The example
>> autoCommit settings might seem superfluous because they don't affect
>> what's searchable, but it is actually a very important configurati

Re: Suggestions for debugging performance issue

2018-06-15 Thread Chris Troullis
Thanks Shawn,

As mentioned previously, we are hard committing every 60 seconds, which we
have been doing for years, and have had no issues until enabling CDCR. We
have never seen large tlog sizes before, and even manually issuing a hard
commit to the collection does not reduce the size of the tlogs. I believe
this is because when using the CDCRUpdateLog the tlogs are not purged until
the docs have been replicated over. Anyway, since we manually purged the
tlogs they seem to now be staying at an acceptable size, so I don't think
that is the cause. The documents are not abnormally large, maybe ~20
string/numeric fields with simple whitespace tokenization.

To answer your questions:

-Solr version: 7.2.1
-What OS vendor and version Solr is running on: CentOS 6
-Total document count on the server (counting all index cores): 13
collections totaling ~60 million docs
-Total index size on the server (counting all cores): ~60GB
-What the total of all Solr heaps on the server is - 16GB heap (we had to
increase for CDCR because it was using a lot more heap).
-Whether there is software other than Solr on the server - No
-How much total memory the server has installed - 64 GB

All of this has been consistent for multiple years across multiple Solr
versions and we have only started seeing this issue once we started using
the CDCRUpdateLog and CDCR, hence why that is the only real thing we can
point to. And again, the issue is only affecting 1 of the 13 collections on
the server, so if it was hardware/heap/GC related then I would think we
would be seeing it for every collection, not just one, as they all share
the same resources.

I will take a look at the GC logs, but I don't think that is the cause. The
consistent nature of the slow performance doesn't really point to GC
issues, and we have profiling set up in New Relic and it does not show any
long/frequent GC pauses.

We are going to try and rebuild the collection from scratch again this
weekend as that has solved the issue in some lower environments, although
it's not really consistent. At this point it's all we can think of to do.

Thanks,

Chris


On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey  wrote:

> On 6/12/2018 12:06 PM, Chris Troullis wrote:
> > The issue we are seeing is with 1 collection in particular, after we set
> up
> > CDCR, we are getting extremely slow response times when retrieving
> > documents. Debugging the query shows QTime is almost nothing, but the
> > overall responseTime is like 5x what it should be. The problem is
> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> > normal, but 200 results is way slower than normal. I can run the exact
> same
> > query multiple times in a row (so everything should be cached), and I
> still
> > see response times way higher than another environment that is not using
> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> > we are using the CDCRUpdateLog. The problem started happening even before
> > we enabled CDCR.
> >
> > In a lower environment we noticed that the transaction logs were huge
> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> > restarting, and that seemed to fix the performance issue. We tried the
> same
> > thing in production the other day but it had no effect, so now I don't
> know
> > if it was a coincidence or not.
>
> There is one other cause besides CDCR buffering that I know of for huge
> transaction logs, and it has nothing to do with CDCR:  A lack of hard
> commits.  It is strongly recommended to have autoCommit set to a
> reasonably short interval (about a minute in my opinion, but 15 seconds
> is VERY common).  Most of the time openSearcher should be set to false
> in the autoCommit config, and other mechanisms (which might include
> autoSoftCommit) should be used for change visibility.  The example
> autoCommit settings might seem superfluous because they don't affect
> what's searchable, but it is actually a very important configuration to
> keep.
>
> Are the docs in this collection really big, by chance?
>
> As I went through previous threads you've started on the mailing list, I
> have noticed that none of your messages provided some details that would
> be useful for looking into performance problems:
>
>  * What OS vendor and version Solr is running on.
>  * Total document count on the server (counting all index cores).
>  * Total index size on the server (counting all cores).
>  * What the total of all Solr heaps on the server is.
>  * Whether there is software other than Solr on the server.
>  * How much total memory the server has installed.
>
> If you name the OS, I can use that information to help you gather some
> additional info which will actually show me most of that list.  Total
> docum

Re: Suggestions for debugging performance issue

2018-06-13 Thread Chris Troullis
Hi Susheel,

It's not drastically different no. There are other collections with more
fields and more documents that don't have this issue. And the collection is
not sharded. Just 1 shard with 2 replicas. Both replicas are similar in
response time.

Thanks,
Chris

On Wed, Jun 13, 2018 at 2:37 PM, Susheel Kumar 
wrote:

> Is this collection anyway drastically different than others in terms of
> schema/# of fields/total document etc is it sharded and if so can you look
> which shard taking more time with shard.info=true.
>
> Thnx
> Susheel
>
> On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis 
> wrote:
>
> > Thanks Erick,
> >
> > Seems to be a mixed bag in terms of tlog size across all of our indexes,
> > but currently the index with the performance issues has 4 tlog files
> > totally ~200 MB. This still seems high to me since the collections are in
> > sync, and we hard commit every minute, but it's less than the ~8GB it was
> > before we cleaned them up. Spot checking some other indexes show some
> have
> > tlogs >3GB, but none of those indexes are having performance issues (on
> the
> > same solr node), so I'm not sure it's related. We have 13 collections of
> > various sizes running on our solr cloud cluster, and none of them seem to
> > have this issue except for this one index, which is not our largest index
> > in terms of size on disk or number of documents.
> >
> > As far as the response intervals, just running a default search *:*
> sorting
> > on our id field so that we get consistent results across environments,
> and
> > returning 200 results (our max page size in app) with ~20 fields, we see
> > times of ~3.5 seconds in production, compared to ~1 second on one of our
> > lower environments with an exact copy of the index. Both have CDCR
> enabled
> > and have identical clusters.
> >
> > Unfortunately, currently the only instance we are seeing the issue on is
> > production, so we are limited in the tests that we can run. I did confirm
> > in the lower environment that the doc cache is large enough to hold all
> of
> > the results, and that both the doc and query caches should be serving the
> > results. Obviously production we have much more indexing going on, but we
> > do utilize autowarming for our caches so our response times are still
> > stable across new searchers.
> >
> > We did move the lower environment to the same ESX host as our production
> > cluster, so that it is getting resources from the same pool (CPU, RAM,
> > etc). The only thing that is different is the disks, but the lower
> > environment is running on slower disks than production. And if it was a
> > disk issue you would think it would be affecting all of the collections,
> > not just this one.
> >
> > It's a mystery!
> >
> > Chris
> >
> >
> >
> > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > First, nice job of eliminating all the standard stuff!
> > >
> > > About tlogs: Sanity check: They aren't growing again, right? They
> > > should hit a relatively steady state. The tlogs are used as a queueing
> > > mechanism for CDCR to durably store updates until they can
> > > successfully be transmitted to the target. So I'd expect them to hit a
> > > fairly steady number.
> > >
> > > Your lack of CPU/IO spikes is also indicative of something weird,
> > > somehow Solr just sitting around doing nothing. What intervals are we
> > > talking about here for response? 100ms? 5000ms?
> > >
> > > When you hammer the same query over and over, you should see your
> > > queryResultCache hits increase. If that's the case, Solr is doing no
> > > work at all for the search, just assembling the resopnse packet which,
> > > as you say, should be in the documentCache. This assumes it's big
> > > enough to hold all of the docs that are requested by all the
> > > simultaneous requests. The queryResultCache cache will be flushed
> > > every time a new searcher is opened. So if you still get your poor
> > > response times, and your queryResultCache hits are increasing then
> > > Solr is doing pretty much nothing.
> > >
> > > So does this behavior still occur if you aren't adding docs to the
> > > index? If you turn indexing off as a test, that'd be another data
> > > point.
> > >
> > > And, of course, if it's at all possible to just take the CDCR
> > > configuration out of your solrconfig file temporarily that'd nail
> > &

Re: Suggestions for debugging performance issue

2018-06-13 Thread Chris Troullis
Thanks Erick,

Seems to be a mixed bag in terms of tlog size across all of our indexes,
but currently the index with the performance issues has 4 tlog files
totally ~200 MB. This still seems high to me since the collections are in
sync, and we hard commit every minute, but it's less than the ~8GB it was
before we cleaned them up. Spot checking some other indexes show some have
tlogs >3GB, but none of those indexes are having performance issues (on the
same solr node), so I'm not sure it's related. We have 13 collections of
various sizes running on our solr cloud cluster, and none of them seem to
have this issue except for this one index, which is not our largest index
in terms of size on disk or number of documents.

As far as the response intervals, just running a default search *:* sorting
on our id field so that we get consistent results across environments, and
returning 200 results (our max page size in app) with ~20 fields, we see
times of ~3.5 seconds in production, compared to ~1 second on one of our
lower environments with an exact copy of the index. Both have CDCR enabled
and have identical clusters.

Unfortunately, currently the only instance we are seeing the issue on is
production, so we are limited in the tests that we can run. I did confirm
in the lower environment that the doc cache is large enough to hold all of
the results, and that both the doc and query caches should be serving the
results. Obviously production we have much more indexing going on, but we
do utilize autowarming for our caches so our response times are still
stable across new searchers.

We did move the lower environment to the same ESX host as our production
cluster, so that it is getting resources from the same pool (CPU, RAM,
etc). The only thing that is different is the disks, but the lower
environment is running on slower disks than production. And if it was a
disk issue you would think it would be affecting all of the collections,
not just this one.

It's a mystery!

Chris



On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson 
wrote:

> First, nice job of eliminating all the standard stuff!
>
> About tlogs: Sanity check: They aren't growing again, right? They
> should hit a relatively steady state. The tlogs are used as a queueing
> mechanism for CDCR to durably store updates until they can
> successfully be transmitted to the target. So I'd expect them to hit a
> fairly steady number.
>
> Your lack of CPU/IO spikes is also indicative of something weird,
> somehow Solr just sitting around doing nothing. What intervals are we
> talking about here for response? 100ms? 5000ms?
>
> When you hammer the same query over and over, you should see your
> queryResultCache hits increase. If that's the case, Solr is doing no
> work at all for the search, just assembling the resopnse packet which,
> as you say, should be in the documentCache. This assumes it's big
> enough to hold all of the docs that are requested by all the
> simultaneous requests. The queryResultCache cache will be flushed
> every time a new searcher is opened. So if you still get your poor
> response times, and your queryResultCache hits are increasing then
> Solr is doing pretty much nothing.
>
> So does this behavior still occur if you aren't adding docs to the
> index? If you turn indexing off as a test, that'd be another data
> point.
>
> And, of course, if it's at all possible to just take the CDCR
> configuration out of your solrconfig file temporarily that'd nail
> whether CDCR is the culprit or whether it's coincidental. You say that
> CDCR is the only difference between the environments, but I've
> certainly seen situations where it turns out to be a bad disk
> controller or something that's _also_ different.
>
> Now, assuming all that's inconclusive, I'm afraid the next step would
> be to throw a profiler at it. Maybe pull a stack traces.
>
> Best,
> Erick
>
> On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis 
> wrote:
> > Thanks Erick. A little more info:
> >
> > -We do have buffering disabled everywhere, as I had read multiple posts
> on
> > the mailing list regarding the issue you described.
> > -We soft commit (with opensearcher=true) pretty frequently (15 seconds)
> as
> > we have some NRT requirements. We hard commit every 60 seconds. We never
> > commit manually, only via the autocommit timers. We have been using these
> > settings for a long time and have never had any issues until recently.
> And
> > all of our other indexes are fine (some larger than this one).
> > -We do have documentResultCache enabled, although it's not very big. But
> I
> > can literally spam the same query over and over again with no other
> queries
> > hitting the box, so all the results should be cached.
> > -We don't see any CPU/IO spikes when ru

Re: Suggestions for debugging performance issue

2018-06-13 Thread Chris Troullis
Thanks Erick. A little more info:

-We do have buffering disabled everywhere, as I had read multiple posts on
the mailing list regarding the issue you described.
-We soft commit (with opensearcher=true) pretty frequently (15 seconds) as
we have some NRT requirements. We hard commit every 60 seconds. We never
commit manually, only via the autocommit timers. We have been using these
settings for a long time and have never had any issues until recently. And
all of our other indexes are fine (some larger than this one).
-We do have documentResultCache enabled, although it's not very big. But I
can literally spam the same query over and over again with no other queries
hitting the box, so all the results should be cached.
-We don't see any CPU/IO spikes when running these queries, our load is
pretty much flat on all accounts.

I know it seems odd that CDCR would be the culprit, but it's really the
only thing we've changed, and we have other environments running the exact
same setup with no issues, so it is really making us tear our hair out. And
when we cleaned up the huge tlogs it didn't seem to make any difference in
the query time (I was originally thinking it was somehow searching through
the tlogs for documents, and that's why it was taking so long to retrieve
the results, but I don't know if that is actually how it works).

Are you aware of any logger settings we could increase to potentially get a
better idea of where the time is being spent? I took the eventual query
response and just hosted as a static file on the same machine via nginx and
it downloaded lightning fast (I was trying to rule out network as the
culprit), so it seems like the time is being spent somewhere in solr.

Thanks,
Chris

On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson 
wrote:

> Having the tlogs be huge is a red flag. Do you have buffering enabled
> in CDCR? This was something of a legacy option that's going to be
> removed, it's been made obsolete by the ability of CDCR to bootstrap
> the entire index. Buffering should be disabled always.
>
> Another reason tlogs can grow is if you have very long times between
> hard commits. I doubt that's your issue, but just in case.
>
> And the final reason tlogs can grow is that the connection between
> source and target clusters is broken, but that doesn't sound like what
> you're seeing either since you say the target cluster is keeping up.
>
> The process of assembling the response can be long. If you have any
> stored fields (and not docValues-enabled), Solr will
> 1> seek the stored data on disk
> 2> decompress (min 16K blocks)
> 3> transmit the thing back to your client
>
> The decompressed version of the doc will be held in the
> documentResultCache configured in solrconfig.xml, so it may or may not
> be cached in memory. That said, this stuff is all MemMapped and the
> decompression isn't usually an issue, I'd expect you to see very large
> CPU spikes and/or I/O contention if that was the case.
>
> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> have to look in the tlogs to get you the very most recent copy, so the
> first place I'd look is keeping the tlogs under control first.
>
> The other possibility (again unrelated to CDCR) is if your spikes are
> coincident with soft commits or hard-commits-with-opensearcher-true.
>
> In all, though, none of the usual suspects seems to make sense here
> since you say that absent configuring CDCR things seem to run fine. So
> I'd look at the tlogs and my commit intervals. Once the tlogs are
> under control then move on to other possibilities if the problem
> persists...
>
> Best,
> Erick
>
>
> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis 
> wrote:
> > Hi all,
> >
> > Recently we have gone live using CDCR on our 2 node solr cloud cluster
> > (7.2.1). From a CDCR perspective, everything seems to be working
> > fine...collections are staying in sync across the cluster, everything
> looks
> > good.
> >
> > The issue we are seeing is with 1 collection in particular, after we set
> up
> > CDCR, we are getting extremely slow response times when retrieving
> > documents. Debugging the query shows QTime is almost nothing, but the
> > overall responseTime is like 5x what it should be. The problem is
> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> > normal, but 200 results is way slower than normal. I can run the exact
> same
> > query multiple times in a row (so everything should be cached), and I
> still
> > see response times way higher than another environment that is not using
> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> > we are using the CDCRUpdateLog. The problem started happening even before
> > we 

Suggestions for debugging performance issue

2018-06-12 Thread Chris Troullis
Hi all,

Recently we have gone live using CDCR on our 2 node solr cloud cluster
(7.2.1). From a CDCR perspective, everything seems to be working
fine...collections are staying in sync across the cluster, everything looks
good.

The issue we are seeing is with 1 collection in particular, after we set up
CDCR, we are getting extremely slow response times when retrieving
documents. Debugging the query shows QTime is almost nothing, but the
overall responseTime is like 5x what it should be. The problem is
exacerbated by larger result sizes. IE retrieving 25 results is almost
normal, but 200 results is way slower than normal. I can run the exact same
query multiple times in a row (so everything should be cached), and I still
see response times way higher than another environment that is not using
CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
we are using the CDCRUpdateLog. The problem started happening even before
we enabled CDCR.

In a lower environment we noticed that the transaction logs were huge
(multiple gigs), so we tried stopping solr and deleting the tlogs then
restarting, and that seemed to fix the performance issue. We tried the same
thing in production the other day but it had no effect, so now I don't know
if it was a coincidence or not.

Things that we have tried:

-Completely deleting the collection and rebuilding from scratch
-Running the query directly from solr admin to eliminate other causes
-Doing a tcpdump on the solr node to eliminate a network issue

None of these things have yielded any results. It seems very inconsistent.
Some environments we can reproduce it in, others we can't.
Hardware/configuration/network is exactly the same between all
envrionments. The only thing that we have narrowed it down to is we are
pretty sure it has something to do with CDCR, as the issue only started
when we started using it.

I'm wondering if any of this sparks any ideas from anyone, or if people
have suggestions as to how I can figure out what is causing this long query
response time? The debug flag on the query seems more geared towards seeing
where time is spent in the actual query, which is nothing in my case. The
time is spent retrieving the results, which I don't have much information
on. I have tried increasing the log level but nothing jumps out at me in
the solr logs. Is there something I can look for specifically to help debug
this?

Thanks,

Chris


Re: Weird transaction log behavior with CDCR

2018-04-17 Thread Chris Troullis
Hi Amrit, thanks for the reply.

I shut down all of the nodes on the source cluster after the buffer was
disabled, and there was no change to the tlogs.

On Tue, Apr 17, 2018 at 12:20 PM, Amrit Sarkar <sarkaramr...@gmail.com>
wrote:

> Chris,
>
> After disabling the buffer on source, kind shut down all the nodes of
> source cluster first and then start them again. The tlogs will be removed
> accordingly. BTW CDCR doesn't abide by 100 numRecordsToKeep or 10 numTlogs.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Tue, Apr 17, 2018 at 8:58 PM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
> > DISABLEBUFFER on source cluster would solve this problem.
> >
> > On Tue, Apr 17, 2018 at 9:29 AM, Chris Troullis <cptroul...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We are attempting to use CDCR with solr 7.2.1 and are experiencing odd
> > > behavior with transaction logs. My understanding is that by default,
> solr
> > > will keep a maximum of 10 tlog files or 100 records in the tlogs. I
> > assume
> > > that with CDCR, the records will not be removed from the tlogs until it
> > has
> > > been confirmed that they have been replicated to the other cluster.
> > > However, even when replication has finished and the CDCR queue sizes
> are
> > 0,
> > > we are still seeing large numbers (50+) and large sizes (over a GB) of
> > > tlogs sitting on the nodes.
> > >
> > > We are hard committing once per minute.
> > >
> > > Doing a lot of reading on the mailing list, I see that a lot of people
> > were
> > > pointing to buffering being enabled as the cause for some of these
> > > transaction log issues. However, we have disabled buffering on both the
> > > source and target clusters, and are still seeing the issues.
> > >
> > > Also, while some of our indexes replicate very rapidly (millions of
> > > documents in minutes), other smaller indexes are crawling. If we
> restart
> > > CDCR on the nodes then it finishes almost instantly.
> > >
> > > Any thoughts on these behaviors?
> > >
> > > Thanks,
> > >
> > > Chris
> > >
> >
>


Weird transaction log behavior with CDCR

2018-04-17 Thread Chris Troullis
Hi,

We are attempting to use CDCR with solr 7.2.1 and are experiencing odd
behavior with transaction logs. My understanding is that by default, solr
will keep a maximum of 10 tlog files or 100 records in the tlogs. I assume
that with CDCR, the records will not be removed from the tlogs until it has
been confirmed that they have been replicated to the other cluster.
However, even when replication has finished and the CDCR queue sizes are 0,
we are still seeing large numbers (50+) and large sizes (over a GB) of
tlogs sitting on the nodes.

We are hard committing once per minute.

Doing a lot of reading on the mailing list, I see that a lot of people were
pointing to buffering being enabled as the cause for some of these
transaction log issues. However, we have disabled buffering on both the
source and target clusters, and are still seeing the issues.

Also, while some of our indexes replicate very rapidly (millions of
documents in minutes), other smaller indexes are crawling. If we restart
CDCR on the nodes then it finishes almost instantly.

Any thoughts on these behaviors?

Thanks,

Chris


Re: CDCR Invalid Number on deletes

2018-03-20 Thread Chris Troullis
Nevermind I found itthe link you posted links me to SOLR-12036 instead
of SOLR-12063 for some reason.

On Tue, Mar 20, 2018 at 1:51 PM, Chris Troullis <cptroul...@gmail.com>
wrote:

> Hey Amrit,
>
> Did you happen to see my last reply?  Is SOLR-12036 the correct JIRA?
>
> Thanks,
>
> Chris
>
> On Wed, Mar 7, 2018 at 1:52 PM, Chris Troullis <cptroul...@gmail.com>
> wrote:
>
>> Hey Amrit, thanks for the reply!
>>
>> I checked out SOLR-12036, but it doesn't look like it has to do with
>> CDCR, and the patch that is attached doesn't look CDCR related. Are you
>> sure that's the correct JIRA number?
>>
>> Thanks,
>>
>> Chris
>>
>> On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar <sarkaramr...@gmail.com>
>> wrote:
>>
>>> Hey Chris,
>>>
>>> I figured a separate issue while working on CDCR which may relate to your
>>> problem. Please see jira: *SOLR-12063*
>>> <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This
>>> is a
>>> bug got introduced when we supported the bidirectional approach where an
>>> extra flag in tlog entry for cdcr is added.
>>>
>>> This part of the code is messing up:
>>> *UpdateLog.java.RecentUpdates::update()::*
>>>
>>> switch (oper) {
>>>   case UpdateLog.ADD:
>>>   case UpdateLog.UPDATE_INPLACE:
>>>   case UpdateLog.DELETE:
>>>   case UpdateLog.DELETE_BY_QUERY:
>>> Update update = new Update();
>>> update.log = oldLog;
>>> update.pointer = reader.position();
>>> update.version = version;
>>>
>>> if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) {
>>>   update.previousVersion = (Long) entry.get(UpdateLog.PREV_VERSI
>>> ON_IDX);
>>> }
>>> updatesForLog.add(update);
>>> updates.put(version, update);
>>>
>>> if (oper == UpdateLog.DELETE_BY_QUERY) {
>>>   deleteByQueryList.add(update);
>>> } else if (oper == UpdateLog.DELETE) {
>>>   deleteList.add(new DeleteUpdate(version,
>>> (byte[])entry.get(entry.size()-1)));
>>> }
>>>
>>> break;
>>>
>>>   case UpdateLog.COMMIT:
>>> break;
>>>   default:
>>> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
>>> "Unknown Operation! " + oper);
>>> }
>>>
>>> deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()
>>> -1)));
>>>
>>> is expecting the last entry to be the payload, but everywhere in the
>>> project, *pos:[2] *is the index for the payload, while the last entry in
>>> source code is *boolean* in / after Solr 7.2, denoting update is cdcr
>>> forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr
>>> sync,
>>> checkpoint operations and hence it is a legit bug, slipped the tests I
>>> wrote.
>>>
>>> The immediate fix patch is uploaded and I am awaiting feedback on that.
>>> Meanwhile if it is possible for you to apply the patch, build the jar and
>>> try it out, please do and let us know.
>>>
>>> For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if
>>> you
>>> can comment on the JIRA and post the sample docs, solr logs, relevant
>>> information, I can give it a thorough look.
>>>
>>> Amrit Sarkar
>>> Search Engineer
>>> Lucidworks, Inc.
>>> 415-589-9269
>>> www.lucidworks.com
>>> Twitter http://twitter.com/lucidworks
>>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>> Medium: https://medium.com/@sarkaramrit2
>>>
>>> On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR
>>> bug
>>> > fixes and features added that would finally let us be able to make use
>>> of
>>> > it (bi-directional syncing was the big one). The first time we tried to
>>> > implement we ran into all kinds of errors, but this time we were able
>>> to
>>> > get it mostly working.
>>> >
>>> > The issue we seem to be having now is that any time a document is
>>> deleted
>>> > via deleteById from a collection on the primary node, we are flooded
>>> with
>>

Re: CDCR Invalid Number on deletes

2018-03-20 Thread Chris Troullis
Hey Amrit,

Did you happen to see my last reply?  Is SOLR-12036 the correct JIRA?

Thanks,

Chris

On Wed, Mar 7, 2018 at 1:52 PM, Chris Troullis <cptroul...@gmail.com> wrote:

> Hey Amrit, thanks for the reply!
>
> I checked out SOLR-12036, but it doesn't look like it has to do with CDCR,
> and the patch that is attached doesn't look CDCR related. Are you sure
> that's the correct JIRA number?
>
> Thanks,
>
> Chris
>
> On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar <sarkaramr...@gmail.com>
> wrote:
>
>> Hey Chris,
>>
>> I figured a separate issue while working on CDCR which may relate to your
>> problem. Please see jira: *SOLR-12063*
>> <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This
>> is a
>> bug got introduced when we supported the bidirectional approach where an
>> extra flag in tlog entry for cdcr is added.
>>
>> This part of the code is messing up:
>> *UpdateLog.java.RecentUpdates::update()::*
>>
>> switch (oper) {
>>   case UpdateLog.ADD:
>>   case UpdateLog.UPDATE_INPLACE:
>>   case UpdateLog.DELETE:
>>   case UpdateLog.DELETE_BY_QUERY:
>> Update update = new Update();
>> update.log = oldLog;
>> update.pointer = reader.position();
>> update.version = version;
>>
>> if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) {
>>   update.previousVersion = (Long) entry.get(UpdateLog.PREV_VERSI
>> ON_IDX);
>> }
>> updatesForLog.add(update);
>> updates.put(version, update);
>>
>> if (oper == UpdateLog.DELETE_BY_QUERY) {
>>   deleteByQueryList.add(update);
>> } else if (oper == UpdateLog.DELETE) {
>>   deleteList.add(new DeleteUpdate(version,
>> (byte[])entry.get(entry.size()-1)));
>> }
>>
>> break;
>>
>>   case UpdateLog.COMMIT:
>> break;
>>   default:
>> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
>> "Unknown Operation! " + oper);
>> }
>>
>> deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()
>> -1)));
>>
>> is expecting the last entry to be the payload, but everywhere in the
>> project, *pos:[2] *is the index for the payload, while the last entry in
>> source code is *boolean* in / after Solr 7.2, denoting update is cdcr
>> forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr
>> sync,
>> checkpoint operations and hence it is a legit bug, slipped the tests I
>> wrote.
>>
>> The immediate fix patch is uploaded and I am awaiting feedback on that.
>> Meanwhile if it is possible for you to apply the patch, build the jar and
>> try it out, please do and let us know.
>>
>> For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if
>> you
>> can comment on the JIRA and post the sample docs, solr logs, relevant
>> information, I can give it a thorough look.
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> Medium: https://medium.com/@sarkaramrit2
>>
>> On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR
>> bug
>> > fixes and features added that would finally let us be able to make use
>> of
>> > it (bi-directional syncing was the big one). The first time we tried to
>> > implement we ran into all kinds of errors, but this time we were able to
>> > get it mostly working.
>> >
>> > The issue we seem to be having now is that any time a document is
>> deleted
>> > via deleteById from a collection on the primary node, we are flooded
>> with
>> > "Invalid Number" errors followed by a random sequence of characters when
>> > CDCR tries to sync the update to the backup site. This happens on all of
>> > our collections where our id fields are defined as longs (some of them
>> the
>> > ids are compound keys and are strings).
>> >
>> > Here's a sample exception:
>> >
>> > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
>> > from server at http://ip/solr/collection_shard1_replica_n1: Invalid
>> > Number:  ]
>> > -s
>> > at
>> > org.apache.solr.client.solrj.impl.CloudSolrClien

Re: CDCR Invalid Number on deletes

2018-03-07 Thread Chris Troullis
Hey Amrit, thanks for the reply!

I checked out SOLR-12036, but it doesn't look like it has to do with CDCR,
and the patch that is attached doesn't look CDCR related. Are you sure
that's the correct JIRA number?

Thanks,

Chris

On Wed, Mar 7, 2018 at 11:21 AM, Amrit Sarkar <sarkaramr...@gmail.com>
wrote:

> Hey Chris,
>
> I figured a separate issue while working on CDCR which may relate to your
> problem. Please see jira: *SOLR-12063*
> <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This is
> a
> bug got introduced when we supported the bidirectional approach where an
> extra flag in tlog entry for cdcr is added.
>
> This part of the code is messing up:
> *UpdateLog.java.RecentUpdates::update()::*
>
> switch (oper) {
>   case UpdateLog.ADD:
>   case UpdateLog.UPDATE_INPLACE:
>   case UpdateLog.DELETE:
>   case UpdateLog.DELETE_BY_QUERY:
> Update update = new Update();
> update.log = oldLog;
> update.pointer = reader.position();
> update.version = version;
>
> if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) {
>   update.previousVersion = (Long) entry.get(UpdateLog.PREV_
> VERSION_IDX);
> }
> updatesForLog.add(update);
> updates.put(version, update);
>
> if (oper == UpdateLog.DELETE_BY_QUERY) {
>   deleteByQueryList.add(update);
> } else if (oper == UpdateLog.DELETE) {
>   deleteList.add(new DeleteUpdate(version,
> (byte[])entry.get(entry.size()-1)));
> }
>
> break;
>
>   case UpdateLog.COMMIT:
> break;
>   default:
> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
> "Unknown Operation! " + oper);
> }
>
> deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()
> -1)));
>
> is expecting the last entry to be the payload, but everywhere in the
> project, *pos:[2] *is the index for the payload, while the last entry in
> source code is *boolean* in / after Solr 7.2, denoting update is cdcr
> forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr sync,
> checkpoint operations and hence it is a legit bug, slipped the tests I
> wrote.
>
> The immediate fix patch is uploaded and I am awaiting feedback on that.
> Meanwhile if it is possible for you to apply the patch, build the jar and
> try it out, please do and let us know.
>
> For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if you
> can comment on the JIRA and post the sample docs, solr logs, relevant
> information, I can give it a thorough look.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR
> bug
> > fixes and features added that would finally let us be able to make use of
> > it (bi-directional syncing was the big one). The first time we tried to
> > implement we ran into all kinds of errors, but this time we were able to
> > get it mostly working.
> >
> > The issue we seem to be having now is that any time a document is deleted
> > via deleteById from a collection on the primary node, we are flooded with
> > "Invalid Number" errors followed by a random sequence of characters when
> > CDCR tries to sync the update to the backup site. This happens on all of
> > our collections where our id fields are defined as longs (some of them
> the
> > ids are compound keys and are strings).
> >
> > Here's a sample exception:
> >
> > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
> > from server at http://ip/solr/collection_shard1_replica_n1: Invalid
> > Number:  ]
> > -s
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > directUpdate(CloudSolrClient.java:549)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > sendRequest(CloudSolrClient.java:1012)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:883)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:945)
> > at
> > org.apache.solr.client.solrj.imp

CDCR Invalid Number on deletes

2018-03-06 Thread Chris Troullis
Hi all,

We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR bug
fixes and features added that would finally let us be able to make use of
it (bi-directional syncing was the big one). The first time we tried to
implement we ran into all kinds of errors, but this time we were able to
get it mostly working.

The issue we seem to be having now is that any time a document is deleted
via deleteById from a collection on the primary node, we are flooded with
"Invalid Number" errors followed by a random sequence of characters when
CDCR tries to sync the update to the backup site. This happens on all of
our collections where our id fields are defined as longs (some of them the
ids are compound keys and are strings).

Here's a sample exception:

org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error
from server at http://ip/solr/collection_shard1_replica_n1: Invalid
Number:  ]
-s
at
org.apache.solr.client.solrj.impl.CloudSolrClient.directUpdate(CloudSolrClient.java:549)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1012)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:883)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:945)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:816)
at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at
org.apache.solr.handler.CdcrReplicator.sendRequest(CdcrReplicator.java:140)
at
org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:104)
at
org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0(CdcrReplicatorScheduler.java:81)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


I'm scratching my head as to the cause of this. It's like it is trying to
deleteById for the value "]", even though that is not the ID for the
document that was deleted from the primary. So I don't know if it is
pulling this from the wrong field somehow or where that value if coming
from.

I found this issue: https://issues.apache.org/jira/browse/SOLR-9394 which
looks related, but doesn't look like it has any traction.

Has anyone else experienced this issue with CDCR, or have any ideas as to
what could be causing this issue?

Thanks,

Chris


Re: Long blocking during indexing + deleteByQuery

2017-11-13 Thread Chris Troullis
I've noticed something weird since implementing the change Shawn suggested,
I wonder if someone can shed some light on it:

Since changing from delete by query _root_:.. to querying for ids _root_:
and then deleteById(ids from root query), we have started to notice some
facet counts for child document facets not matching the actual query
results. For example, facet shows a count of 10, click on the facet which
applies a FQ with block join to return parent docs, and the number of
results is less than the facet count, when they should match (facet count
is doing a unique(_root_) so is only counting parents). I suspect that this
may be somehow caused by orphaned child documents since the delete process
changed.

Does anyone know if changing from a DBQ: _root_ to the aforementioned
querying for ids _root_ and delete by id would cause any issues with
deleting child documents? Just trying manually it seems to work fine, but
something is going on in some of our test environments.

Thanks,

Chris

On Thu, Nov 9, 2017 at 2:52 PM, Chris Troullis <cptroul...@gmail.com> wrote:

> Thanks Mike, I will experiment with that and see if it does anything for
> this particular issue.
>
> I implemented Shawn's workaround and the problem has gone away, so that is
> good at least for the time being.
>
> Do we think that this is something that should be tracked in JIRA for 6.X?
> Or should I confirm if it is still happening in 7.X before logging anything?
>
> On Wed, Nov 8, 2017 at 6:23 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> I'm not sure this is what's affecting you, but you might try upgrading to
>> Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple
>> threads to resolve deletions:
>> http://blog.mikemccandless.com/2017/07/lucene-gets-concurren
>> t-deletes-and.html
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Nov 7, 2017 at 2:26 PM, Chris Troullis <cptroul...@gmail.com>
>> wrote:
>>
>> > @Erick, I see, thanks for the clarification.
>> >
>> > @Shawn, Good idea for the workaround! I will try that and see if it
>> > resolves the issue.
>> >
>> > Thanks,
>> >
>> > Chris
>> >
>> > On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson <erickerick...@gmail.com
>> >
>> > wrote:
>> >
>> > > bq: you think it is caused by the DBQ deleting a document while a
>> > > document with that same ID
>> > >
>> > > No. I'm saying that DBQ has no idea _if_ that would be the case so
>> > > can't carry out the operations in parallel because it _might_ be the
>> > > case.
>> > >
>> > > Shawn:
>> > >
>> > > IIUC, here's the problem. For deleteById, I can guarantee the
>> > > sequencing through the same optimistic locking that regular updates
>> > > use (i.e. the _version_ field). But I'm kind of guessing here.
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey <apa...@elyograg.org>
>> > wrote:
>> > > > On 11/5/2017 12:20 PM, Chris Troullis wrote:
>> > > >> The issue I am seeing is when some
>> > > >> threads are adding/updating documents while other threads are
>> issuing
>> > > >> deletes (using deleteByQuery), solr seems to get into a state of
>> > extreme
>> > > >> blocking on the replica
>> > > >
>> > > > The deleteByQuery operation cannot coexist very well with other
>> > indexing
>> > > > operations.  Let me tell you about something I discovered.  I think
>> > your
>> > > > problem is very similar.
>> > > >
>> > > > Solr 4.0 and later is supposed to be able to handle indexing
>> operations
>> > > > at the same time that the index is being optimized (in Lucene,
>> > > > forceMerge).  I have some indexes that take about two hours to
>> > optimize,
>> > > > so having indexing stop while that happens is a less than ideal
>> > > > situation.  Ongoing indexing is similar in many ways to a merge,
>> enough
>> > > > that it is handled by the same Merge Scheduler that handles an
>> > optimize.
>> > > >
>> > > > I could indeed add documents to the index without issues at the same
>> > > > time as an optimize, but when I would try my full indexing cycle
>> while
>> > > > an optimize was underway, I found that all operations stopped until
>> the
>> > > > optimize finished.
>> > > >
>> > > > Ultimately what was determined (I think it was Yonik that figured it
>> > > > out) was that *most* indexing operations can happen during the
>> > optimize,
>> > > > *except* for deleteByQuery.  The deleteById operation works just
>> fine.
>> > > >
>> > > > I do not understand the low-level reasons for this, but apparently
>> it's
>> > > > not something that can be easily fixed.
>> > > >
>> > > > A workaround is to send the query you plan to use with
>> deleteByQuery as
>> > > > a standard query with a limited fl parameter, to retrieve matching
>> > > > uniqueKey values from the index, then do a deleteById with that
>> list of
>> > > > ID values instead.
>> > > >
>> > > > Thanks,
>> > > > Shawn
>> > > >
>> > >
>> >
>>
>
>


Re: Long blocking during indexing + deleteByQuery

2017-11-09 Thread Chris Troullis
Thanks Mike, I will experiment with that and see if it does anything for
this particular issue.

I implemented Shawn's workaround and the problem has gone away, so that is
good at least for the time being.

Do we think that this is something that should be tracked in JIRA for 6.X?
Or should I confirm if it is still happening in 7.X before logging anything?

On Wed, Nov 8, 2017 at 6:23 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I'm not sure this is what's affecting you, but you might try upgrading to
> Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple
> threads to resolve deletions:
> http://blog.mikemccandless.com/2017/07/lucene-gets-
> concurrent-deletes-and.html
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Nov 7, 2017 at 2:26 PM, Chris Troullis <cptroul...@gmail.com>
> wrote:
>
> > @Erick, I see, thanks for the clarification.
> >
> > @Shawn, Good idea for the workaround! I will try that and see if it
> > resolves the issue.
> >
> > Thanks,
> >
> > Chris
> >
> > On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> > > bq: you think it is caused by the DBQ deleting a document while a
> > > document with that same ID
> > >
> > > No. I'm saying that DBQ has no idea _if_ that would be the case so
> > > can't carry out the operations in parallel because it _might_ be the
> > > case.
> > >
> > > Shawn:
> > >
> > > IIUC, here's the problem. For deleteById, I can guarantee the
> > > sequencing through the same optimistic locking that regular updates
> > > use (i.e. the _version_ field). But I'm kind of guessing here.
> > >
> > > Best,
> > > Erick
> > >
> > > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey <apa...@elyograg.org>
> > wrote:
> > > > On 11/5/2017 12:20 PM, Chris Troullis wrote:
> > > >> The issue I am seeing is when some
> > > >> threads are adding/updating documents while other threads are
> issuing
> > > >> deletes (using deleteByQuery), solr seems to get into a state of
> > extreme
> > > >> blocking on the replica
> > > >
> > > > The deleteByQuery operation cannot coexist very well with other
> > indexing
> > > > operations.  Let me tell you about something I discovered.  I think
> > your
> > > > problem is very similar.
> > > >
> > > > Solr 4.0 and later is supposed to be able to handle indexing
> operations
> > > > at the same time that the index is being optimized (in Lucene,
> > > > forceMerge).  I have some indexes that take about two hours to
> > optimize,
> > > > so having indexing stop while that happens is a less than ideal
> > > > situation.  Ongoing indexing is similar in many ways to a merge,
> enough
> > > > that it is handled by the same Merge Scheduler that handles an
> > optimize.
> > > >
> > > > I could indeed add documents to the index without issues at the same
> > > > time as an optimize, but when I would try my full indexing cycle
> while
> > > > an optimize was underway, I found that all operations stopped until
> the
> > > > optimize finished.
> > > >
> > > > Ultimately what was determined (I think it was Yonik that figured it
> > > > out) was that *most* indexing operations can happen during the
> > optimize,
> > > > *except* for deleteByQuery.  The deleteById operation works just
> fine.
> > > >
> > > > I do not understand the low-level reasons for this, but apparently
> it's
> > > > not something that can be easily fixed.
> > > >
> > > > A workaround is to send the query you plan to use with deleteByQuery
> as
> > > > a standard query with a limited fl parameter, to retrieve matching
> > > > uniqueKey values from the index, then do a deleteById with that list
> of
> > > > ID values instead.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > >
> >
>


Re: Long blocking during indexing + deleteByQuery

2017-11-07 Thread Chris Troullis
@Erick, I see, thanks for the clarification.

@Shawn, Good idea for the workaround! I will try that and see if it
resolves the issue.

Thanks,

Chris

On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: you think it is caused by the DBQ deleting a document while a
> document with that same ID
>
> No. I'm saying that DBQ has no idea _if_ that would be the case so
> can't carry out the operations in parallel because it _might_ be the
> case.
>
> Shawn:
>
> IIUC, here's the problem. For deleteById, I can guarantee the
> sequencing through the same optimistic locking that regular updates
> use (i.e. the _version_ field). But I'm kind of guessing here.
>
> Best,
> Erick
>
> On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> > On 11/5/2017 12:20 PM, Chris Troullis wrote:
> >> The issue I am seeing is when some
> >> threads are adding/updating documents while other threads are issuing
> >> deletes (using deleteByQuery), solr seems to get into a state of extreme
> >> blocking on the replica
> >
> > The deleteByQuery operation cannot coexist very well with other indexing
> > operations.  Let me tell you about something I discovered.  I think your
> > problem is very similar.
> >
> > Solr 4.0 and later is supposed to be able to handle indexing operations
> > at the same time that the index is being optimized (in Lucene,
> > forceMerge).  I have some indexes that take about two hours to optimize,
> > so having indexing stop while that happens is a less than ideal
> > situation.  Ongoing indexing is similar in many ways to a merge, enough
> > that it is handled by the same Merge Scheduler that handles an optimize.
> >
> > I could indeed add documents to the index without issues at the same
> > time as an optimize, but when I would try my full indexing cycle while
> > an optimize was underway, I found that all operations stopped until the
> > optimize finished.
> >
> > Ultimately what was determined (I think it was Yonik that figured it
> > out) was that *most* indexing operations can happen during the optimize,
> > *except* for deleteByQuery.  The deleteById operation works just fine.
> >
> > I do not understand the low-level reasons for this, but apparently it's
> > not something that can be easily fixed.
> >
> > A workaround is to send the query you plan to use with deleteByQuery as
> > a standard query with a limited fl parameter, to retrieve matching
> > uniqueKey values from the index, then do a deleteById with that list of
> > ID values instead.
> >
> > Thanks,
> > Shawn
> >
>


Re: Long blocking during indexing + deleteByQuery

2017-11-07 Thread Chris Troullis
If I am understanding you correctly, you think it is caused by the DBQ
deleting a document while a document with that same ID is being updated by
another thread? I'm not sure that is what is happening here, as we only
delete docs if they no longer exist in the DB, so nothing should be
adding/updating a doc with that ID if it is marked for deletion, as we
don't reuse IDs. I will double check though to confirm.

Also, not sure if relevant, but the DBQ itself returns very quickly, in a
matter of ms, it's the updates that block for a huge amount of time.

On Tue, Nov 7, 2017 at 11:08 AM, Amrit Sarkar <sarkaramr...@gmail.com>
wrote:

> Maybe not a relevant fact on this, but: "addAndDelete" is triggered by
> "*Reordering
> of DBQs'; *that means there are non-executed DBQs present in the updateLog
> and an add operation is also received. Solr makes sure DBQs are executed
> first and than add operation is executed.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Tue, Nov 7, 2017 at 9:19 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Well, consider what happens here.
> >
> > Solr gets a DBQ that includes document 132 and 10,000,000 other docs
> > Solr gets an add for document 132
> >
> > The DBQ takes time to execute. If it was processing the requests in
> > parallel would 132 be in the index after the delete was over? It would
> > depend on when the DBQ found the doc relative to the add.
> > With this sequence one would expect 132 to be in the index at the end.
> >
> > And it's worse when it comes to distributed indexes. If the updates
> > were sent out in parallel you could end up in situations where one
> > replica contained 132 and another didn't depending on the vagaries of
> > thread execution.
> >
> > Now I didn't write the DBQ code, but that's what I think is happening.
> >
> > Best,
> > Erick
> >
> > On Tue, Nov 7, 2017 at 7:40 AM, Chris Troullis <cptroul...@gmail.com>
> > wrote:
> > > As an update, I have confirmed that it doesn't seem to have anything to
> > do
> > > with child documents, or standard deletes, just deleteByQuery. If I do
> a
> > > deleteByQuery on any collection while also adding/updating in separate
> > > threads I am experiencing this blocking behavior on the non-leader
> > replica.
> > >
> > > Has anyone else experienced this/have any thoughts on what to try?
> > >
> > > On Sun, Nov 5, 2017 at 2:20 PM, Chris Troullis <cptroul...@gmail.com>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> I am experiencing an issue where threads are blocking for an extremely
> > >> long time when I am indexing while deleteByQuery is also running.
> > >>
> > >> Setup info:
> > >> -Solr Cloud 6.6.0
> > >> -Simple 2 Node, 1 Shard, 2 replica setup
> > >> -~12 million docs in the collection in question
> > >> -Nodes have 64 GB RAM, 8 CPUs, spinning disks
> > >> -Soft commit interval 10 seconds, Hard commit (open searcher false) 60
> > >> seconds
> > >> -Default merge policy settings (Which I think is 10/10).
> > >>
> > >> We have a query heavy index heavyish use case. Indexing is constantly
> > >> running throughout the day and can be bursty. The indexing process
> > handles
> > >> both updates and deletes, can spin up to 15 simultaneous threads, and
> > sends
> > >> to solr in batches of 3000 (seems to be the optimal number per trial
> and
> > >> error).
> > >>
> > >> I can build the entire collection from scratch using this method in <
> 40
> > >> mins and indexing is in general super fast (averages about 3 seconds
> to
> > >> send a batch of 3000 docs to solr). The issue I am seeing is when some
> > >> threads are adding/updating documents while other threads are issuing
> > >> deletes (using deleteByQuery), solr seems to get into a state of
> extreme
> > >> blocking on the replica, which results in some threads taking 30+
> > minutes
> > >> just to send 1 batch of 3000 docs. This collection does use child
> > documents
> > >> (hence the delete by query _root_), not sure if that makes a
> > difference, I
> > >> am trying to duplicate on a non-child doc collection. CPU/IO wait
> seems
&g

Re: Long blocking during indexing + deleteByQuery

2017-11-07 Thread Chris Troullis
As an update, I have confirmed that it doesn't seem to have anything to do
with child documents, or standard deletes, just deleteByQuery. If I do a
deleteByQuery on any collection while also adding/updating in separate
threads I am experiencing this blocking behavior on the non-leader replica.

Has anyone else experienced this/have any thoughts on what to try?

On Sun, Nov 5, 2017 at 2:20 PM, Chris Troullis <cptroul...@gmail.com> wrote:

> Hi,
>
> I am experiencing an issue where threads are blocking for an extremely
> long time when I am indexing while deleteByQuery is also running.
>
> Setup info:
> -Solr Cloud 6.6.0
> -Simple 2 Node, 1 Shard, 2 replica setup
> -~12 million docs in the collection in question
> -Nodes have 64 GB RAM, 8 CPUs, spinning disks
> -Soft commit interval 10 seconds, Hard commit (open searcher false) 60
> seconds
> -Default merge policy settings (Which I think is 10/10).
>
> We have a query heavy index heavyish use case. Indexing is constantly
> running throughout the day and can be bursty. The indexing process handles
> both updates and deletes, can spin up to 15 simultaneous threads, and sends
> to solr in batches of 3000 (seems to be the optimal number per trial and
> error).
>
> I can build the entire collection from scratch using this method in < 40
> mins and indexing is in general super fast (averages about 3 seconds to
> send a batch of 3000 docs to solr). The issue I am seeing is when some
> threads are adding/updating documents while other threads are issuing
> deletes (using deleteByQuery), solr seems to get into a state of extreme
> blocking on the replica, which results in some threads taking 30+ minutes
> just to send 1 batch of 3000 docs. This collection does use child documents
> (hence the delete by query _root_), not sure if that makes a difference, I
> am trying to duplicate on a non-child doc collection. CPU/IO wait seems
> minimal on both nodes, so not sure what is causing the blocking.
>
> Here is part of the stack trace on one of the blocked threads on the
> replica:
>
> qtp592179046-576 (576)
> java.lang.Object@608fe9b5
> org.apache.solr.update.DirectUpdateHandler2.addAndDelete​(
> DirectUpdateHandler2.java:354)
> org.apache.solr.update.DirectUpdateHandler2.addDoc0​(
> DirectUpdateHandler2.java:237)
> org.apache.solr.update.DirectUpdateHandler2.addDoc​(
> DirectUpdateHandler2.java:194)
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(
> RunUpdateProcessorFactory.java:67)
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(
> UpdateRequestProcessor.java:55)
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(
> DistributedUpdateProcessor.java:979)
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(
> DistributedUpdateProcessor.java:1192)
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(
> DistributedUpdateProcessor.java:748)
> org.apache.solr.handler.loader.JavabinLoader$1.update​
> (JavabinLoader.java:98)
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.
> readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:180)
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.
> readIterator​(JavaBinUpdateRequestCodec.java:136)
> org.apache.solr.common.util.JavaBinCodec.readObject​(
> JavaBinCodec.java:306)
> org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:251)
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.
> readNamedList​(JavaBinUpdateRequestCodec.java:122)
> org.apache.solr.common.util.JavaBinCodec.readObject​(
> JavaBinCodec.java:271)
> org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:251)
> org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:173)
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(
> JavaBinUpdateRequestCodec.java:187)
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(
> JavabinLoader.java:108)
> org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:55)
> org.apache.solr.handler.UpdateRequestHandler$1.load​(
> UpdateRequestHandler.java:97)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(
> ContentStreamHandlerBase.java:68)
> org.apache.solr.handler.RequestHandlerBase.handleRequest​(
> RequestHandlerBase.java:173)
> org.apache.solr.core.SolrCore.execute​(SolrCore.java:2477)
> org.apache.solr.servlet.HttpSolrCall.execute​(HttpSolrCall.java:723)
> org.apache.solr.servlet.HttpSolrCall.call​(HttpSolrCall.java:529)
>
> A cursory search lead me to this JIRA https://issues.apache.
> org/jira/browse/SOLR-7836, not sure if related though.
>
> Can anyone shed some light on this issue? We don't do deletes very
> frequently, but it is bringing solr to it's knees when we do, which is
> causing some big problems.
>
> Thanks,
>
> Chris
>


Long blocking during indexing + deleteByQuery

2017-11-05 Thread Chris Troullis
Hi,

I am experiencing an issue where threads are blocking for an extremely long
time when I am indexing while deleteByQuery is also running.

Setup info:
-Solr Cloud 6.6.0
-Simple 2 Node, 1 Shard, 2 replica setup
-~12 million docs in the collection in question
-Nodes have 64 GB RAM, 8 CPUs, spinning disks
-Soft commit interval 10 seconds, Hard commit (open searcher false) 60
seconds
-Default merge policy settings (Which I think is 10/10).

We have a query heavy index heavyish use case. Indexing is constantly
running throughout the day and can be bursty. The indexing process handles
both updates and deletes, can spin up to 15 simultaneous threads, and sends
to solr in batches of 3000 (seems to be the optimal number per trial and
error).

I can build the entire collection from scratch using this method in < 40
mins and indexing is in general super fast (averages about 3 seconds to
send a batch of 3000 docs to solr). The issue I am seeing is when some
threads are adding/updating documents while other threads are issuing
deletes (using deleteByQuery), solr seems to get into a state of extreme
blocking on the replica, which results in some threads taking 30+ minutes
just to send 1 batch of 3000 docs. This collection does use child documents
(hence the delete by query _root_), not sure if that makes a difference, I
am trying to duplicate on a non-child doc collection. CPU/IO wait seems
minimal on both nodes, so not sure what is causing the blocking.

Here is part of the stack trace on one of the blocked threads on the
replica:

qtp592179046-576 (576)
java.lang.Object@608fe9b5
org.apache.solr.update.DirectUpdateHandler2.addAndDelete​(DirectUpdateHandler2.java:354)
org.apache.solr.update.DirectUpdateHandler2.addDoc0​(DirectUpdateHandler2.java:237)
org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:194)
org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:67)
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:55)
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:979)
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:1192)
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:748)
org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:98)
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:180)
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:136)
org.apache.solr.common.util.JavaBinCodec.readObject​(JavaBinCodec.java:306)
org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:251)
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:122)
org.apache.solr.common.util.JavaBinCodec.readObject​(JavaBinCodec.java:271)
org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:251)
org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:173)
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:187)
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:108)
org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:55)
org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:97)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:68)
org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:173)
org.apache.solr.core.SolrCore.execute​(SolrCore.java:2477)
org.apache.solr.servlet.HttpSolrCall.execute​(HttpSolrCall.java:723)
org.apache.solr.servlet.HttpSolrCall.call​(HttpSolrCall.java:529)

A cursory search lead me to this JIRA
https://issues.apache.org/jira/browse/SOLR-7836, not sure if related though.

Can anyone shed some light on this issue? We don't do deletes very
frequently, but it is bringing solr to it's knees when we do, which is
causing some big problems.

Thanks,

Chris


Re: Inconsistency in results between replicas using CloudSolrClient

2017-08-01 Thread Chris Troullis
Thanks for the reply Erick, I feared that would be the case. Interesting
idea with using the fq but not sure I like the performance implications. I
will see how big of a deal it will be in practice, I was just thinking
about this as a hypothetical scenario today, and as you said, we have a lot
of automated tests so I anticipate this likely causing issues. I'll give it
some more thought and see if I can come up with any other workarounds.

-Chris

On Tue, Aug 1, 2017 at 5:38 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You're understanding is correct.
>
> As for how people cope? Mostly they ignore it. The actual number of
> times people notice this is usually quite small, mostly it surfaces
> when automated test suites are run.
>
> If you must lock this up, and you can stand the latency you could add
> a timestamp for each document and auto-add an FQ clause like:
> fq=timestamp:[* TO NOW-soft_commit_interval_plus_some_windage]
>
> Note, though, that this not an fq clause that can be re-used, see:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/ so
> either it'd be something like:
> fq=timestamp:[* TO NOW/MINUTE-soft_commit_interval_plus_some_windage]
> or
> fq=timestamp:{!cache=false}[* TO NOW-soft_commit_interval_plus_
> some_windage]
>
> and would inevitably make the latency between when something was
> indexed and available for search longer.
>
> You can also reduce your soft commit interval to something short, but
> that has other problems.
>
> see: SOLR-6606, but it looks like other priorities have gotten in the
> way of it being committed.
>
> Best,
> Erick
>
> On Tue, Aug 1, 2017 at 1:50 PM, Chris Troullis <cptroul...@gmail.com>
> wrote:
> > Hi,
> >
> > I think I know the answer to this question, but just wanted to verify/see
> > what other people do to address this concern.
> >
> > I have a Solr Cloud setup (6.6.0) with 2 nodes, 1 collection with 1 shard
> > and 2 replicas (1 replica per node). The nature of my use case requires
> > frequent updates to Solr, and documents are being added constantly
> > throughout the day. I am using CloudSolrClient via SolrJ to query my
> > collection and load balance across my 2 replicas.
> >
> > Here's my question:
> >
> > As I understand it, because of the nature of Solr Cloud (eventual
> > consistency), and the fact that the soft commit timings on the 2 replicas
> > will not necessarily be in sync, would it not be possible to run into a
> > scenario where, say a document gets indexed on replica 1 right before a
> > soft commit, but indexed on replica 2 right after a soft commit? In this
> > scenario, using the load balanced CloudSolrClient, wouldn't it be
> possible
> > for a user to do a search, see the newly added document because they got
> > sent to replica 1, and then search again, and the newly added document
> > would disappear from their results since they got sent to replica 2 and
> the
> > soft commit hasn't happened yet?
> >
> > If so, how do people typically handle this scenario in NRT search cases?
> It
> > seems like a poor user experience if things keep disappearing and
> > reappearing from their search results randomly. Currently the only
> thought
> > I have to prevent this is to write (or extend) my own solr client to
> stick
> > a user's session to a specific replica (unless it goes down), but still
> > load balance users between the replicas. But of course then I have to
> > manage all of the things CloudSolrClient manages manually re: cluster
> > state, etc.
> >
> > Can anyone confirm/deny my understanding of how this works/offer any
> > suggestions to eliminate the scenario in question from occurring?
> >
> > Thanks,
> >
> > Chris
>


Inconsistency in results between replicas using CloudSolrClient

2017-08-01 Thread Chris Troullis
Hi,

I think I know the answer to this question, but just wanted to verify/see
what other people do to address this concern.

I have a Solr Cloud setup (6.6.0) with 2 nodes, 1 collection with 1 shard
and 2 replicas (1 replica per node). The nature of my use case requires
frequent updates to Solr, and documents are being added constantly
throughout the day. I am using CloudSolrClient via SolrJ to query my
collection and load balance across my 2 replicas.

Here's my question:

As I understand it, because of the nature of Solr Cloud (eventual
consistency), and the fact that the soft commit timings on the 2 replicas
will not necessarily be in sync, would it not be possible to run into a
scenario where, say a document gets indexed on replica 1 right before a
soft commit, but indexed on replica 2 right after a soft commit? In this
scenario, using the load balanced CloudSolrClient, wouldn't it be possible
for a user to do a search, see the newly added document because they got
sent to replica 1, and then search again, and the newly added document
would disappear from their results since they got sent to replica 2 and the
soft commit hasn't happened yet?

If so, how do people typically handle this scenario in NRT search cases? It
seems like a poor user experience if things keep disappearing and
reappearing from their search results randomly. Currently the only thought
I have to prevent this is to write (or extend) my own solr client to stick
a user's session to a specific replica (unless it goes down), but still
load balance users between the replicas. But of course then I have to
manage all of the things CloudSolrClient manages manually re: cluster
state, etc.

Can anyone confirm/deny my understanding of how this works/offer any
suggestions to eliminate the scenario in question from occurring?

Thanks,

Chris


Re: Seeing odd behavior with implicit routing

2017-05-16 Thread Chris Troullis
Shalin,

Thanks for the response and explanation! I logged a JIRA per your request
here: https://issues.apache.org/jira/browse/SOLR-10695

Chris


On Mon, May 15, 2017 at 3:40 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Sun, May 14, 2017 at 7:40 PM, Chris Troullis <cptroul...@gmail.com>
> wrote:
> > Hi,
> >
> > I've been experimenting with various sharding strategies with Solr cloud
> > (6.5.1), and am seeing some odd behavior when using the implicit router.
> I
> > am probably either doing something wrong or misinterpreting what I am
> > seeing in the logs, but if someone could help clarify that would be
> awesome.
> >
> > I created a collection using the implicit router, created 10 shards,
> named
> > shard1, shard2, etc. I indexed 3000 documents to each shard, routed by
> > setting the _route_ field on the documents in my schema. All works fine,
> I
> > verified there are 3000 documents in each shard.
> >
> > The odd behavior I am seeing is when I try to route a query to a specific
> > shard. I submitted a simple query to shard1 using the request parameter
> > _route_=shard1. The query comes back fine, but when I looked in the logs,
> > it looked like it was issuing 3 separate requests:
> >
> > 1. The original query to shard1
> > 2. A 2nd query to shard1 with the parameter ids=a bunch of document ids
> > 3. The original query to a random shard (changes every time I run the
> query)
> >
> > It looks like the first query is getting back a list of ids, and the 2nd
> > query is retrieving the documents for those ids? I assume this is some
> solr
> > cloud implementation detail.
> >
> > What I don't understand is the 3rd query. Why is it issuing the original
> > query to a random shard every time, when I am specifying the _route_? The
> > _route_ parameter is definitely doing something, because if I remove it,
> it
> > is querying all shards (which I would expect).
> >
> > Any ideas? I can provide the actual queries from the logs if required.
>
> How many nodes is this collection distributed across? I suspect that
> you are using a single node for experimentation?
>
> What happens with _route_=shard1 parameter and implicit routing is
> that the _route_ parameter is resolved to a list of replicas of
> shard1. But, SolrJ uses only the node name of the replica along with
> the collection name to make the request (this is important, we'll come
> back to this later). So, ordinarily, that node hosts a single shard
> (shard1) and when it receives the request, it will optimize the search
> to go the non-distributed code path (since the replica has all the
> data needed to satisfy the search).
>
> But interesting things happen when the node hosts more than one shard
> (say shard1 and shard3 both). When we query such a node using just the
> collection name, the collection name can be resolved to either shard1
> or shard3 -- this is picked randomly without looking at _route_
> parameter at all. If shard3 is picked, it looks at the request, sees
> that it doesn't have all the necessary data and decides to follow the
> two-phase distributed search path where phase 1 is to get the ids and
> score of the documents matching the query from all participating
> shards (the list of such shards is limited by _route_ parameter, which
> in our case will be only shard1) and a second phase where we get the
> actual stored fields to be returned to the user. So you get three
> queries in the log, 1) phase 1 of distributed search hitting shard1,
> 2) phase two of distributed search hitting shard1 and 3) the
> distributed scatter-gather search run by shard3.
>
> So to recap, this is happening because you have more than one shard1
> hosted on a node. Easy workaround is to have each shard hosted on a
> unique node. But we can improve things on the solr side as well by 1)
> having SolrJ resolve requests down to node name and core name, 2)
> having the collection name to core name resolution take _route_ param
> into account. Both 1 and 2 can solve the problem. Can you please open
> a Jira issue?
>
> >
> > Thanks,
> >
> > Chris
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Seeing odd behavior with implicit routing

2017-05-14 Thread Chris Troullis
Hi,

I've been experimenting with various sharding strategies with Solr cloud
(6.5.1), and am seeing some odd behavior when using the implicit router. I
am probably either doing something wrong or misinterpreting what I am
seeing in the logs, but if someone could help clarify that would be awesome.

I created a collection using the implicit router, created 10 shards, named
shard1, shard2, etc. I indexed 3000 documents to each shard, routed by
setting the _route_ field on the documents in my schema. All works fine, I
verified there are 3000 documents in each shard.

The odd behavior I am seeing is when I try to route a query to a specific
shard. I submitted a simple query to shard1 using the request parameter
_route_=shard1. The query comes back fine, but when I looked in the logs,
it looked like it was issuing 3 separate requests:

1. The original query to shard1
2. A 2nd query to shard1 with the parameter ids=a bunch of document ids
3. The original query to a random shard (changes every time I run the query)

It looks like the first query is getting back a list of ids, and the 2nd
query is retrieving the documents for those ids? I assume this is some solr
cloud implementation detail.

What I don't understand is the 3rd query. Why is it issuing the original
query to a random shard every time, when I am specifying the _route_? The
_route_ parameter is definitely doing something, because if I remove it, it
is querying all shards (which I would expect).

Any ideas? I can provide the actual queries from the logs if required.

Thanks,

Chris


Re: Multiple collections vs multiple shards for multitenancy

2017-05-07 Thread Chris Troullis
Thanks for the great advice Erick. I will experiment with your suggestions
and see how it goes!

Chris

On Sun, May 7, 2017 at 12:34 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Well, you've been doing your homework ;).
>
> bq: I am a little confused on this statement you made:
>
> > Plus you can't commit
> > individually, a commit on one will _still_ commit on all so you're
> > right back where you started.
>
> Never mind. autocommit kicks off on a per replica basis. IOW, when a
> new doc is indexed to a shard (really, any replica) the timer is
> started. So if replica 1_1 gets a doc and replica 2_1 doesn't, there
> is no commit on replica 2_1. My comment was mainly directed at the
> idea that you might issue commits from the client, which are
> distributed to all replicas. However, even in that case the a replica
> that has received no updates won't do anything.
>
> About the hybrid approach. I've seen situations where essentially you
> partition clients along "size" lines. So something like "put clients
> on a shared single-shard collection as long as the aggregate number of
> records is < X". The theory is that the update frequency is roughly
> the same if you have 10 clients with 100K docs each .vs. one client
> with 1M docs. So the pain of opening a new searcher is roughly the
> same. "X" here is experimentally determined.
>
> Do note that moving from master/slave to SolrCloud will reduce
> latency. In M/S, the time it takes to search is autocommit + poling
> interval + autowarm time. Going to SolrCloud will remove the "polling
> interval" from the equation. Not sure how much that helps
>
> There should be an autowarm statistic in the Solr logs BTW. Or some
> messages about "opening searcher (some hex stuff) and another message
> about when it's registered as active, along with timestamps. That'll
> tell you how long it takes to autowarm.
>
> OK. "straw man" strategy for your case. Create a collection per
> tenant. What you want to balance is where the collections are hosted.
> Host a number of small tenants on the same Solr instance and fewer
> larger tenants on other hardware. FWIW, I expect at least 25M docs per
> Solr JVM (very hardware dependent of course), although testing is
> important.
>
> Under the covers, each Solr instance establishes "watchers" on the
> collections it hosts. So if a particular Solr hosts replicas for, say,
> 10 collections, it establishes 10 watchers on the state.json zNode in
> Zookeeper. 300 collections isn't all that much in recent Solr
> installations. All that filtered through how beefy your hardware is of
> course.
>
> Startup is an interesting case, but I've put 1,600 replicas on 4 Solr
> instance on a Mac Pro (400 each). You can configure the number of
> startup threads if starting up is too painful.
>
> So a cluster with 300 collections isn't really straining things. Some
> of the literature is talking about thousands of collections.
>
> Good luck!
> Erick
>
> On Sat, May 6, 2017 at 4:26 PM, Chris Troullis <cptroul...@gmail.com>
> wrote:
> > Hi Erick,
> >
> > Thanks for the reply, I really appreciate it.
> >
> > To answer your questions, we have a little over 300 tenants, and a couple
> > of different collections, the largest of which has ~11 million documents
> > (so not terribly large). We are currently running standard Solr with
> simple
> > master/slave replication, so all of the documents are in a single solr
> > core. We are planning to move to Solr cloud for various reasons, and as
> > discussed previously, I am trying to find the best way to distribute the
> > documents to serve a more NRT focused search case.
> >
> > I totally get your point on pushing back on NRT requirements, and I have
> > done so for as long as I can. Currently our auto softcommit is set to 1
> > minute and we are able to achieve great query times with autowarming.
> > Unfortunately, due to the nature of our application, our customers expect
> > any changes they make to be visible almost immediately in search, and we
> > have recently been getting a lot of complaints in this area, leading to
> an
> > initiative to drive down the time it takes for documents to become
> visible
> > in search. Which leaves me where I am now, trying to find the right
> balance
> > between document visibility and reasonable, stable, query times.
> >
> > Regarding autowarming, our autowarming times aren't too crazy. We are
> > warming a max of 100 entries from the filter cache and it takes around
> 5-10
> > seconds to complete on average. I suspect our biggest

Re: Multiple collections vs multiple shards for multitenancy

2017-05-06 Thread Chris Troullis
gt; feeding...
>
> Sharding a single large collection and using custom routing to push
> tenants to a single shard will be an administrative problem for you.
> I'm assuming you have the typical multi-tenant problems, a bunch of
> tenants have around N docs, some smaller percentage have 3N and a few
> have 100N. Now you're having to keep track of how many docs are on
> each shard, do the routing yourself, etc. Plus you can't commit
> individually, a commit on one will _still_ commit on all so you're
> right back where you started.
>
> I've seen people use a hybrid approach: experiment with how many
> _documents_ you can have in a collection (however you partition that
> up) and use the multi-tenant approach. So you have N collections and
> each collection has a (varying) number of tenants. This also tends to
> flatten out the update process on the assumption that your smaller
> tenants also don't update their data as often.
>
> However, I really have to question one of your basic statements:
>
> "This works fine with aggressive autowarming, but I have a need to reduce
> my NRT
> search capabilities to seconds as opposed to the minutes it is at now,"...
>
> The implication here is that your autowarming takes minutes. Very
> often people severely overdo the warmup by setting their autowarm
> counts to 100s or 1000s. This is rarely necessary, especially if you
> use docValues fields appropriately. Very often much of autowarming is
> "uninverting" fields (look in your Solr log). Essentially for any
> field you see this, use docValues and loading will be much faster.
>
> You also haven't said how many documents you have in a shard at
> present. This is actually the metric I use most often to size
> hardware. I claim you can find a sweet spot where minimal autowarming
> will give you good enough performance, and that number is what you
> should design to. Of course YMMV.
>
> Finally: push back really hard on how aggressive NRT support needs to
> be. Often "requirements" like this are made without much thought as
> "faster is better, let's make it 1 second!". There are situations
> where that's true, but it comes at a cost. Users may be better served
> by a predictable but fast system than one that's fast but
> unpredictable. "Documents may take up to 5 minutes to appear and
> searches will usually take less than a second" is nice and concise. I
> have my expectations. "Documents are searchable in 1 second, but the
> results may not come back for between 1 and 10 seconds" is much more
> frustrating.
>
> FWIW,
> Erick
>
> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com>
> wrote:
> > Hi,
> >
> > I use Solr to serve multiple tenants and currently all tenant's data
> > resides in one large collection, and queries have a tenant identifier.
> This
> > works fine with aggressive autowarming, but I have a need to reduce my
> NRT
> > search capabilities to seconds as opposed to the minutes it is at now,
> > which will mean drastically reducing if not eliminating my autowarming.
> As
> > such I am considering splitting my index out by tenant so that when one
> > tenant modifies their data it doesn't blow away all of the searcher based
> > caches for all tenants on soft commit.
> >
> > I have done a lot of research on the subject and it seems like Solr Cloud
> > can have problems handling large numbers of collections. I'm obviously
> > going to have to run some tests to see how it performs, but my main
> > question is this: are there pros and cons to splitting the index into
> > multiple collections vs having 1 collection but splitting into multiple
> > shards? In my case I would have a shard per tenant and use implicit
> routing
> > to route to that specific shard. As I understand it a shard is basically
> > it's own lucene index, so I would still be eating that overhead with
> either
> > approach. What I don't know is if there are any other overheads involved
> > WRT collections vs shards, routing, zookeeper, etc.
> >
> > Thanks,
> >
> > Chris
>


Multiple collections vs multiple shards for multitenancy

2017-05-06 Thread Chris Troullis
Hi,

I use Solr to serve multiple tenants and currently all tenant's data
resides in one large collection, and queries have a tenant identifier. This
works fine with aggressive autowarming, but I have a need to reduce my NRT
search capabilities to seconds as opposed to the minutes it is at now,
which will mean drastically reducing if not eliminating my autowarming. As
such I am considering splitting my index out by tenant so that when one
tenant modifies their data it doesn't blow away all of the searcher based
caches for all tenants on soft commit.

I have done a lot of research on the subject and it seems like Solr Cloud
can have problems handling large numbers of collections. I'm obviously
going to have to run some tests to see how it performs, but my main
question is this: are there pros and cons to splitting the index into
multiple collections vs having 1 collection but splitting into multiple
shards? In my case I would have a shard per tenant and use implicit routing
to route to that specific shard. As I understand it a shard is basically
it's own lucene index, so I would still be eating that overhead with either
approach. What I don't know is if there are any other overheads involved
WRT collections vs shards, routing, zookeeper, etc.

Thanks,

Chris


Sharding strategy for optimal NRT performance

2017-04-15 Thread Chris Troullis
Hi!

I am looking for some advice on an sharding strategy that will produce
optimal performance in the NRT search case for my setup. I have come up
with a strategy that I think will work based on my experience, testing, and
reading of similar questions on the mailing list, but I was hoping to run
my idea by some experts to see if I am on the right track or am completely
off base.

*Let's start off with some background info on my use case*:

We are currently using Solr (5.5.2) with the classic Master/Slave setup.
Because of our NRT requirements, the slave is pretty much only used for
failover, all writes/reads go to the master (which I know is not ideal, but
that's what we're working with!). We have 6 different indexes with
completely different schemas for various searches in our application. We
have just over 300 tenants, which currently all reside within the same
index for each of our indexes. We separate our tenants at query time via a
filter query with a tenant identifier (which works fine). Each index is not
tremendously large, they range from 1M documents to the largest being
around 12M documents. Our load is not huge as search is not the core
functionality of our application, but merely a tool to get to what they are
looking for in the app. I believe our peak load barely goes over 1 QPS.
Even though our number of documents isn't super high, we do some pretty
complex faceting, and block joins in some cases, which along with crappy
hardware in our data center (no SSDs) initially led to some pretty poor
query times for our customers. This was due to the fact that we are
constantly indexing throughout the day (job that runs once per minute), and
we auto soft commit (openSearcher=true) every 1 minute. Because of the
nature of our application, NRT updates are necessary. As we all know,
opening searches this frequently has the drawback of invalidating all of
our searcher-based caches, causing query times to be erratic, and slower on
average. With our current setup, we have solved our query performance times
by setting up autowarming, both on the filter cache, and via static warming
queries.

*The problem:*

So now for the problem. While we are now running great from a performance
perspective, we are receiving complaints from customers saying that the
changes they are making are slow to be reflected in search. Because of the
nature of our application, this has significant impact on their user
experience, and is an issue we need to solve. Overall, we would like to be
able to reduce our NRT visibility from the minutes we have now down to
seconds. The problem is doing this in a way that won't significantly affect
our query performance. We are already seeing maxWarmingSearchers warnings
in our logs occasionally with our current setup, so just indexing more
frequently is not a viable solution. In addition to this, autowarming in
itself is problematic for the NRT use case, as the new searcher won't start
serving requests until it is fully warmed anyway, which is sort of counter
to the goal of decreasing the time it takes for new documents to be visible
in search. And so this is the predicament we find ourselves in. We can
index more frequently (and soft commit more frequently), but we will have
to remove (or greatly decrease) our autowarming, which will destroy our
search performance. Obviously there is some give and take here, we can't
have true NRT search with optimal query performance, but I am hoping to
find a solution that will provide acceptable results for both.

*Proposed solution:*

I have done a lot of research and experimentation on this issue and have
started coming up with what I believe will be a decent solution to the
aforementioned problem. First off, I would like to make the move over to
Solr Cloud. We had been contemplating this for a while anyway, as we
currently have no load balancing at all (since our slave is just used for
failover), but I am also thinking that by using the right sharding strategy
we can improve our NRT experience as well. I first started looking at the
standard composite id routing, and while we can ensure that all of a single
tenant's data is located on the same shard, because there is a large
discrepancy between the amounts of data our tenants have, our shards would
be very unevenly distributed in terms of number of documents. Ideally, we
would like all of our tenants to be isolated from a performance perspective
(from a security perspective we are not really concerned, as all of our
queries have a tenant identifier filter query already). Basically, we don't
want tiny tenant A to be screwed over because they were unlucky enough to
land on Huge tenant B's shard. We do know the footprint of each tenant in
terms of number of documents, so technically we could work out a sharding
strategy manually which would evenly distribute the tenants based on number
of documents, but since we have 6 different indexes, and with each index
the tenant's document distribution will be