RE: Partial replication blocks subsequent requests when using solrcloud and master/slave replication

2016-11-22 Thread Jeremy Hoy
Thanks for your response Erick.

Yes this looks like the issue we're seeing.  I'll paste in the relevant parts 
of my email to the issue.  The scenario you're describing is a little different 
to ours, so I don't think the naïve patch we have necessarily suits.  I'll do a 
bit more research and see if I can put something better together.

Incidentally if the blocking behaviour is intended, I'm sure not implemented 
quite right, since it's entirely possible for a new searcher to be created in 
between the point where the existing searcher is closed the new index writer 
gets the write lock, but we really don't want that fixed! ;)

Thanks again,

Jeremy

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 22 November 2016 18:37
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Partial replication blocks subsequent requests when using 
solrcloud and master/slave replication

This sounds a lot like:

https://issues.apache.org/jira/browse/SOLR-9706


Could you attach your patch to that issue if you think it's the same?
And please copy/paste your e-mail in a comment if you would, you've obviously 
done more research on the cause than I did and that'd save some work whenever 
someone picks it up.

It's unclear to me whether this is intentional behavior or an accident of code, 
either way having a place to start when analyzing is much appreciated.


Best,
Erick

On Tue, Nov 22, 2016 at 10:02 AM, Jeremy Hoy <j...@findmypast.com> wrote:
> Hi All,
>
> We're running a fairly non-standard solr configuration.  We ingest into named 
> shards in master cores and then replicate out to slaves running solr cloud.  
> So in effect we are using solrcloud only to manage the config files and more 
> importantly to look after the cluster state.  Our corpus and search workload, 
> is such that this makes sense to reduce the need to query every shard for 
> each search since the majority of queries contain values that allow is to 
> target search towards the shards holding the appropriate documents, also this 
> isolates the searching slaves from the costs of indexing (we index fairly 
> infrequently, but in fairly large volumes).  I'm happy to expand on this if 
> anyone's is interested or take suggestions as to how to we might better be 
> doing things.
>
> We've been running 4.6.0 for the past 3 years or so, but have recently 
> upgraded to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly.   However we 
> hit a problem when running 5.5.2, which we also replicated in 6.2.1 and 
> 6.3.0.  When a partial replication starts this will usually block all 
> subsequent requests to solr, whilst replication continues in the background.  
> Whilst in this blocked state we took thread dumps using VisualVM; we see this 
> when running 6.3.0:
>
> "explicit-fetchindex-cmd" - Thread t@71
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> ..
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
> at 
> org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279)
> at 
> org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown 
> Source)
> at java.lang.Thread.run(Thread.java:745)
>
>Locked ownable synchronizers:
> - locked <4c18799d> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>
> - locked <64a00f> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>
> and
>
> "qtp1873653341-61" - Thread t@61
>java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - waiting to lock <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by 
> "explicit-fetchindex-cmd" t@71
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>

Re: Partial replication blocks subsequent requests when using solrcloud and master/slave replication

2016-11-22 Thread Erick Erickson
This sounds a lot like:

https://issues.apache.org/jira/browse/SOLR-9706


Could you attach your patch to that issue if you think it's the same?
And please copy/paste your e-mail in a comment if you would, you've
obviously done more research on the cause than I did and that'd save
some work whenever someone picks it up.

It's unclear to me whether this is intentional behavior or an accident
of code, either way having a place to start when analyzing is much
appreciated.


Best,
Erick

On Tue, Nov 22, 2016 at 10:02 AM, Jeremy Hoy  wrote:
> Hi All,
>
> We're running a fairly non-standard solr configuration.  We ingest into named 
> shards in master cores and then replicate out to slaves running solr cloud.  
> So in effect we are using solrcloud only to manage the config files and more 
> importantly to look after the cluster state.  Our corpus and search workload, 
> is such that this makes sense to reduce the need to query every shard for 
> each search since the majority of queries contain values that allow is to 
> target search towards the shards holding the appropriate documents, also this 
> isolates the searching slaves from the costs of indexing (we index fairly 
> infrequently, but in fairly large volumes).  I'm happy to expand on this if 
> anyone's is interested or take suggestions as to how to we might better be 
> doing things.
>
> We've been running 4.6.0 for the past 3 years or so, but have recently 
> upgraded to 5.5.2 - we'll likely be upgrading to 6.3.0 shortly.   However we 
> hit a problem when running 5.5.2, which we also replicated in 6.2.1 and 
> 6.3.0.  When a partial replication starts this will usually block all 
> subsequent requests to solr, whilst replication continues in the background.  
> Whilst in this blocked state we took thread dumps using VisualVM; we see this 
> when running 6.3.0:
>
> "explicit-fetchindex-cmd" - Thread t@71
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> ..
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1463)
> at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
> at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
> at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
> at 
> org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(ReplicationHandler.java:279)
> at 
> org.apache.solr.handler.ReplicationHandler$$Lambda$82/776974667.run(Unknown 
> Source)
> at java.lang.Thread.run(Thread.java:745)
>
>Locked ownable synchronizers:
> - locked <4c18799d> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>
> - locked <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>
> and
>
> "qtp1873653341-61" - Thread t@61
>java.lang.Thread.State: TIMED_WAITING
> at sun.misc.Unsafe.park(Native Method)
> - waiting to lock <64a00f> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by 
> "explicit-fetchindex-cmd" t@71
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
> at 
> org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:159)
> at 
> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:104)
> at 
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1781)
> at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
> at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
> at 
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1577)
> .
>
>
> The cause of the problem seems to be that in IndexFetcher.fetchLatestIndex, 
> when the running as solrcloud, the searcher is shut down prior to cleaning up 
> the existing segment files and downloading the new ones.
>
> 6.3.0 - Lines(407-409)
> if 
> (solrCore.getCoreDescriptor().getCoreContainer().isZooKeeperAware()) {
> solrCore.closeSearcher();
> }
>
>