Hi Shawn,
Here is some update. We found the main issue
We have configured our cluster to run under jetty and when we tried full
indexing, we did not see the original Invalid Chunk error. However the
replicas still went into recovery
All this time we been trying to look into replicas logs to diagnose the
issue. The problem seem to be at the leader side. When we looked into
leader logs, we found the following on all the leaders

3439873 [qtp1314570047-92] WARN
 org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
sending update
*java.lang.IllegalStateException: Queue full*
        at java.util.AbstractQueue.add(AbstractQueue.java:98)
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner$1.writeTo(ConcurrentUpdateSolrServer.java:182)
        at
org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:69)
        at
org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
        at
org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
        at
org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
        at
org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
        at
org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
        at
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
        at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
        at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
        at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
        at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
3439874 [qtp1314570047-92] INFO  org.apache.solr.common.cloud.SolrZkClient
 – makePath:
/collections/search1/leader_initiated_recovery/shard7/core_node214
3439879 [qtp1314570047-92] INFO  org.apache.solr.cloud.ZkController  –
Wrote down to
/collections/search1/leader_initiated_recovery/shard7/core_node214
3439879 [qtp1314570047-92] INFO  org.apache.solr.cloud.ZkController  – Put
replica core=search1_shard7_replica1 coreNodeName=core_node214 on
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:8580_solr into leader-initiated
recovery.
3439880 [qtp1314570047-92] WARN  org.apache.solr.cloud.ZkController  –
Leader is publishing core=search1_shard7_replica1 coreNodeName
=core_node214 state=down on behalf of un-reachable replica
http://XXXXXXXXXXXXXXX:YYYYYY/solr/search1_shard7_replica1/;
forcePublishState? false
3439881 [zkCallback-2-thread-12] INFO
 org.apache.solr.cloud.DistributedQueue  – LatchChildWatcher fired on path:
/overseer/queue state: SyncConnected type NodeChildrenChanged
3439881 [qtp1314570047-92] ERROR
org.apache.solr.update.processor.DistributedUpdateProcessor  – *Setting up
to try to start recovery on replica
http://XXXXXXXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYY/solr/search1_shard7_replica1/
after: java.lang.IllegalStateException: Queue full*
3439882 [qtp1314570047-92] INFO  org.apache.solr.core.SolrCore  –
[search1_shard7_replica2] webapp=/solr path=/update
params={wt=javabin&version=2} status=0 QTime=2608
3439882 [updateExecutor-1-thread-153] INFO
 org.apache.solr.cloud.LeaderInitiatedRecoveryThread  –
LeaderInitiatedRecoveryThread-search1_shard7_replica1 started running to
send REQUESTRECOVERY command to
http://XXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYY/solr/search1_shard7_replica1/;
will try for a max of 600 secs
3439882 [updateExecutor-1-thread-153] INFO
 org.apache.solr.cloud.LeaderInitiatedRecoveryThread  – Asking
core=search1_shard7_replica1 coreNodeName=core_node214 on
http://XXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYY/solr to recover
3439885
[OverseerStateUpdate-93213309377511456-XXXXXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYY_solr-n_0000002822]
INFO  org.apache.solr.cloud.Overseer


There is a similar bug reported around this
https://issues.apache.org/jira/browse/SOLR-5850

and it seem to be in OPEN status. Is there a way we can configure the queue
size and increase it ? or is there a version of solr that has this issue
resolved already?
Can you suggest where we go from here to resolve this ? We can repatch the
war file if that is what you would recommend .
In the end our initial speculation about solr unable to handle so many
update is correct. We do not see this issue when the update load is less.




Here is the log from the replica FYI. No invalid chunk error with Jetty but
the issue at hand is unrelated

2382763 [Thread-74] INFO  org.apache.solr.cloud.ZkController  – Wrote
recovering to /collections/search1/leader_initiated_recovery
/shard7/core_node214
2382776 [Thread-74] INFO  org.apache.solr.update.DefaultSolrCoreState  –
Running recovery - first canceling any ongoing recovery
2382777 [Thread-74] WARN  org.apache.solr.cloud.RecoveryStrategy  –
Stopping recovery for core=search1_shard7_replica1 coreNodeName=core_node214
2382778 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Starting recovery process.  core=search1_shard7_replica1
recoveringAfterStartup=false
2382779 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Publishing state of core search1_shard7_replica1 as recovering, leader is
http://XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYY/solr/search1_shard7_replica2/
and I am http://XXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYY
/solr/search1_shard7_replica1/
2382779 [RecoveryThread] INFO  org.apache.solr.cloud.ZkController  –
publishing core=search1_shard7_replica1 state=recovering collection=search1
2382779 [RecoveryThread] INFO  org.apache.solr.cloud.ZkController  –
numShards not found on descriptor - reading it from system property
2382785 [zkCallback-2-thread-5] INFO
 org.apache.solr.common.cloud.ZkStateReader  – A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 40)
2382796 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Sending prep recovery command to http://XXXXXXXXXXXXXX:YYYYYYYYYYYYYYYY/solr;
WaitForState:
action=PREPRECOVERY&core=search1_shard7_replica2&nodeName=XXXXXXXXXXXXXXXXXXXXXX%3AXXXXXXXXXXX_solr&coreNodeName=core_node214&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
2382953 [zkCallback-2-thread-5] INFO
 org.apache.solr.common.cloud.ZkStateReader  – A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 40)
2385815 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Attempting to PeerSync from
http://sXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYY/solr/search1_shard7_replica2/
core=search1_shard7_replica1 - recoveringAfterStartup=false
2385815 [RecoveryThread] INFO  org.apache.solr.update.PeerSync  – PeerSync:
core=search1_shard7_replica1 url=http://XXXXXXXXXXXXXX:YYYYYYYY/solr START
replicas=[http://XXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYY/solr/search1_shard7_replica2/]
nUpdates=100
2385816 [RecoveryThread] WARN  org.apache.solr.update.PeerSync  – PeerSync:
core=search1_shard7_replica1 url=http://XXXXXXXXXXXXXXXX:YYYYYYYYYYYYYY/solr
too many updates received since start - startingUpdates no longer overlaps
with our currentUpdates
2385817 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
PeerSync Recovery was not successful - trying replication.
core=search1_shard7_replica1
2385817 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Starting Replication Recovery. core=search1_shard7_replica1
2385817 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Begin buffering updates. core=search1_shard7_replica1
2385817 [RecoveryThread] INFO  org.apache.solr.update.UpdateLog  – Starting
to buffer updates. FSUpdateLog{state=ACTIVE,
tlog=tlog{file=/opt/solr/solrnodes/solrnode1/search1_shard7_replica1/data/tlog/tlog.0000000000000000407
refcount=1}}
2385817 [RecoveryThread] INFO  org.apache.solr.cloud.RecoveryStrategy  –
Attempting to replicate from
http://XXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYY/solr/search1_shard7_replica2/.
core=search1_shard7_replica1
2386469 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –  No
value set for 'pollInterval'. Timer Task not started.
2386477 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –
Master's generation: 963
2386478 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –
Slave's generation: 962
2386478 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –
Starting replication process
2386489 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  – Number
of files in latest index in master: 250
2386489 [RecoveryThread] INFO  org.apache.solr.core.CachingDirectoryFactory
 – return new directory for
/opt/solr/solrnodes/solrnode1/search1_shard7_replica1/data/index.20150127135840219
2386491 [RecoveryThread] WARN  org.apache.solr.handler.SnapPuller  – File
_89e.fdt expected to be 811432 while it is 4204287

-regards
Vijay


On Tue, Jan 27, 2015 at 12:07 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 1/26/2015 9:34 PM, Vijay Sekhri wrote:
> > Hi Shawn, Erick
> > So it turned out that once we increased our indexing rate to the original
> > full indexing rate  the replicas went back into recovery no matter what
> the
> > zk timeout setting was. Initially we though that increasing the timeout
> is
> > helping but apparently not . We just decreased indexing rate and that
> > caused less replicas to go in recovery. Once we have our full indexing
> rate
> > almost all replicas went into recovery no matter what the zk timeout or
> the
> > ticktime setting were. We reverted back the ticktime to original 2
> seconds
> >
> > So we investigated further and after checking the logs we found this
> > exception happening right before the recovery process is initiated. We
> > observed this on two different replicas that went into recovery. We are
> not
> > sure if this is a coincidence or a real problem . Notice we were also
> > putting some search query load while indexing to trigger the recovery
> > behavior
>
> <snip>
>
> > 22:00:40,861 ERROR [org.apache.solr.core.SolrCore]
> (http-/10.235.46.36:8580-32)
> > ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk
> > header*
>
> One possibility that my searches on that exception turned up is that
> this is some kind of a problem in the servlet container, and the
> information I can see suggests it may be a bug in JBoss, and the
> underlying cause is changes in newer releases of Java 7.  Your
> stacktraces do seem to mention jboss classes, so that seems likely.  The
> reason that we only recommend running under the Jetty that comes with
> Solr, which has a tuned config, is because that's the only servlet
> container that actually gets tested.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1104273
> https://bugzilla.redhat.com/show_bug.cgi?id=1154028
>
> I can't really verify any other possibility.
>
> Thanks,
> Shawn
>
>


-- 
*********************************************
Vijay Sekhri
*********************************************

Reply via email to