RE: CDCR tlog corruption leads to infinite loop

Webster Homer Wed, 11 Sep 2019 11:58:44 -0700

We also see an accumulation of tlog files on the target solrs. One of our 
production clouds crashed due to too many open files
2019-09-11 15:59:39.570 ERROR (qtp1355531311-81540) 
[c:bioreliance-catalog-testarticle-20190713 s:shard2 r:core_node8 
x:bioreliance-catalog-testarticle-20190713_shard2_replica_n6] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: 
java.io.FileNotFoundException: 
/var/solr/data/bioreliance-catalog-testarticle-20190713_shard2_replica_n6/data/tlog/tlog.0000000000000005307.1642472809370222592
 (Too many open files)


We found 9106 open files. 

This is our update request handler
<updateHandler class="solr.DirectUpdateHandler2">

    <!-- Enables a transaction log, used for real-time get, durability, and
         and solr cloud replica recovery.  The log can grow as big as
         uncommitted changes to the index, so use of a hard autoCommit
         is recommended (see below).
         "dir" - the target directory for transaction logs, defaults to the
                solr data directory.  --> 

    <updateLog class="solr.CdcrUpdateLog">
      <str name="dir">${solr.ulog.dir:}</str>
    </updateLog>

    <!-- AutoCommit

       -->
     <autoCommit> 
       <maxTime>${solr.autoCommit.maxTime:60000}</maxTime> 
       <openSearcher>false</openSearcher> 
     </autoCommit>

     <autoSoftCommit> 
       <maxTime>${solr.autoSoftCommit.maxTime:3000}</maxTime> 
     </autoSoftCommit>

  </updateHandler>

solr.autoSoftCommit.maxTime is set to 3000
solr.autoCommit.maxTime is set to 60000

-----Original Message-----
From: Webster Homer <webster.ho...@milliporesigma.com> 
Sent: Monday, September 09, 2019 4:17 PM
To: solr-user@lucene.apache.org
Subject: CDCR tlog corruption leads to infinite loop

We are running Solr 7.2.0

Our configuration has several collections that are loaded into a solr cloud 
which is set to replicate using CDCR to 3 different solrclouds. All of our 
target collections have 2 shards with two replicas per shard. Our source 
collection has 2 shards, and 1 replica per shard.

Frequently we start to see errors where the target collections are out of date, 
and the cdcr action=errors endpoint shows large numbers of errors For example:
{"responseHeader": {
"status": 0,
"QTime": 0},
"errors": [
"uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr",
["sial-catalog-product-20190824",
[
"consecutiveErrors",
700357,
"bad_request",
0,
"internal",
700357,
"last",
[
"2019-09-09T19:17:57.453Z",
"internal",
"2019-09-09T19:17:56.949Z",
"internal",
"2019-09-09T19:17:56.448Z"
,"internal",...

We have found that one or more tlogs have become corrupt. It appears that the 
CDCR keeps trying to send data, but cannot read the data from the tlog and then 
it retrys, forever.
How does this happen?  It seems to be very frequent, on a weekly basis and 
difficult to trouble shoot Today we had it happen with one of our collections. 
Here is the listing for the tlog files:

$ ls -alht
total 604M
drwxr-xr-x 2 apache apache  44K Sep  9 14:27 .
-rw-r--r-- 1 apache apache 6.7M Sep  6 19:44 
tlog.0000000000000000766.1643975309914013696
-rw-r--r-- 1 apache apache  35M Sep  6 19:43 
tlog.0000000000000000765.1643975245907886080
-rw-r--r-- 1 apache apache  30M Sep  6 19:42 
tlog.0000000000000000764.1643975182924120064
-rw-r--r-- 1 apache apache  37M Sep  6 19:41 
tlog.0000000000000000763.1643975118316109824
-rw-r--r-- 1 apache apache  19M Sep  6 19:40 
tlog.0000000000000000762.1643975053918863360
-rw-r--r-- 1 apache apache  21M Sep  6 19:39 
tlog.0000000000000000761.1643974989726089216
-rw-r--r-- 1 apache apache  21M Sep  6 19:38 
tlog.0000000000000000760.1643974926010417152
-rw-r--r-- 1 apache apache  29M Sep  6 19:37 
tlog.0000000000000000759.1643974862567374848
-rw-r--r-- 1 apache apache 6.2M Sep  6 19:10 
tlog.0000000000000000758.1643973174027616256
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache  27M Sep  5 19:48 
tlog.0000000000000000756.1643884946565103616
-rw-r--r-- 1 apache apache  35M Sep  5 19:47 
tlog.0000000000000000755.1643884877912735744
-rw-r--r-- 1 apache apache  30M Sep  5 19:46 
tlog.0000000000000000754.1643884812724862976
-rw-r--r-- 1 apache apache  25M Sep  5 19:45 
tlog.0000000000000000753.1643884748976685056
-rw-r--r-- 1 apache apache  18M Sep  5 19:44 
tlog.0000000000000000752.1643884685794738176
-rw-r--r-- 1 apache apache  21M Sep  5 19:43 
tlog.0000000000000000751.1643884621330382848
-rw-r--r-- 1 apache apache  16M Sep  5 19:42 
tlog.0000000000000000750.1643884558054064128
-rw-r--r-- 1 apache apache  26M Sep  5 19:41 
tlog.0000000000000000749.1643884494725316608
-rw-r--r-- 1 apache apache 5.8M Sep  5 19:12 
tlog.0000000000000000748.1643882681969147904
-rw-r--r-- 1 apache apache  31M Sep  4 19:56 
tlog.0000000000000000747.1643794877229563904
-rw-r--r-- 1 apache apache  31M Sep  4 19:55 
tlog.0000000000000000746.1643794813706829824
-rw-r--r-- 1 apache apache  30M Sep  4 19:54 
tlog.0000000000000000745.1643794749615767552
-rw-r--r-- 1 apache apache  22M Sep  4 19:53 
tlog.0000000000000000744.1643794686253465600
-rw-r--r-- 1 apache apache  18M Sep  4 19:52 
tlog.0000000000000000743.1643794622319689728
-rw-r--r-- 1 apache apache  21M Sep  4 19:51 
tlog.0000000000000000742.1643794558055612416
-rw-r--r-- 1 apache apache  15M Sep  4 19:50 
tlog.0000000000000000741.1643794493330161664
-rw-r--r-- 1 apache apache  26M Sep  4 19:49 
tlog.0000000000000000740.1643794428790308864
-rw-r--r-- 1 apache apache  11M Sep  4 14:58 
tlog.0000000000000000737.1643701398824550400
drwxr-xr-x 5 apache apache   53 Aug 21 06:30 ..
[apache@dfw-pauth-msc01 tlog]$ ls -alht 
tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.0000000000000000757.1643885009483857920
$ date
Mon Sep  9 14:27:31 CDT 2019
$ pwd
/var/solr/data/sial-catalog-product-20190824_shard1_replica_n1/data/tlog

CDCR started replicating after we deleted the oldest tlog file and restarted 
CDCR
tlog.0000000000000000737.1643701398824550400

About the same time I found a number of errors in the solr logs like this:
2019-09-04 19:58:01.393 ERROR 
(recoveryExecutor-162-thread-1-processing-n:dfw-pauth-msc01:8983_solr 
x:sial-catalog-product-20190824_shard1_replica_n1 s:shard1 
c:sial-catalog-product-20190824 r:core_node3) [c:sial-catalog-product-20190824 
s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] 
o.a.s.u.UpdateLog java.lang.ClassCastException

This was the most common error at the time, I saw it for all of our collections
2019-09-04 19:57:46.572 ERROR (qtp1355531311-20) 
[c:sial-catalog-product-20190824 s:shard1 r:core_node3 
x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to 
non-leader replica

I would like to understand how this happens, and find better ways to diagnose 
it early.

This occurs frequently enough that we are thinking about developing an 
alternative to CDCR as it does not seem to be very robust

Any help would be appreciated.

RE: CDCR tlog corruption leads to infinite loop

Reply via email to