Definitely seems like KUDU-3002 could apply here. Andrew, are you aware of a straightforward way to determine whether a given table (or cluster) would benefit from the fix for KUDU-3002?
One possibility is that, if your vendor is in the business of providing patch releases, they could supply the fix for KUDU-3002 as a patch for your release of Kudu. Looking at the fix, it's not super invasive and should be pretty easy to backport into an older Kudu release. Have you talked to them about that? On Tue, Mar 31, 2020 at 5:58 AM Franco VENTURI <fvent...@comcast.net> wrote: > > Adar, Andrew, > thanks for your detailed and prompt replies. > > "Fortunately" (for your questions) we have another TS whose WALs disk is > currently about 80% full (and three more whose WALs disk is above > 50%), and > I suspect that it will be the next one we'll have to restart in a few nights. > > On this TS the output from 'lsof' this morning shows 175,550 files open, of > which 84,876 are Kudu data files and 90,340 are Kudu WALs. > > For this server these are the top 10 tablets by WAL sizes (in kB): > > du -sk * | sort -nr | head -10 > 31400552 b3facf00fcff403293d36c1032811e6e > 31204488 354088f7954047908b4e68e0627836b8 > 30584928 90a0465b7b4f4ed7a3c5e43d993cf52e > 30536168 c5369eb17772432bbe326abf902c8055 > 30535900 4540cba1b331429a8cbacf08acf2a321 > 30503820 cb302d0772a64a5795913196bdf43ed3 > 30428040 f318cbf0cdcc4e10979fc1366027dde5 > 30379552 4779c42e5ef6493ba4d51dc44c97f7f7 > 29671692 a4ac51eefc59467abf45938529080c17 > 29539940 b00cb81794594d9b8366980a24bf79ad > > and these are the top 10 tablets by number of WAL segments: > > for t in *; do echo "$(ls $t | grep -c '^wal-') $t"; done | sort -nr | head > -10 > 3813 b3facf00fcff403293d36c1032811e6e > 3784 354088f7954047908b4e68e0627836b8 > 3716 90a0465b7b4f4ed7a3c5e43d993cf52e > 3705 c5369eb17772432bbe326abf902c8055 > 3705 4540cba1b331429a8cbacf08acf2a321 > 3700 cb302d0772a64a5795913196bdf43ed3 > 3698 f318cbf0cdcc4e10979fc1366027dde5 > 3685 4779c42e5ef6493ba4d51dc44c97f7f7 > 3600 a4ac51eefc59467abf45938529080c17 > 3585 b00cb81794594d9b8366980a24bf79ad > > as you can see, the largest tablets by WAL size are also the largest ones by > number of WAL segments. > > Taking a more detailed look at the largest of these tablets > (b3facf00fcff403293d36c1032811e6e), these are the TS's that host a replica of > that tablet from the output of the command 'kudu table list': > > T b3facf00fcff403293d36c1032811e6e > L e4a4195a39df41f0b04887fdcae399d8 ts07:7050 > V 147fcef6fb49437aa19f7a95fb26c091 ts11:7050 > V 59fe260f21da48059ff5683c364070ce ts31:7050 > > where ts07 (the leader) is the TS whose WALs disk is about 80% full. > > I looked at the 'ops_behind_leader' metric for that tablet on the other two > TS's (ts11 and ts31) by querying their metrics, and they are both 0. > > > As for the memory pressure, the leader (ts07) shows the following metrics: > > leader_memory_pressure_rejections": 22397 > transaction_memory_pressure_rejections: 0 > follower_memory_pressure_rejections: 0 > > > Finally a couple of non-technical comments about KUDU-3002 > (https://issues.apache.org/jira/browse/KUDU-3002): > > - I can see it has been fixed in Kudu 1.12.0; however we (as probably most > other enterprise customers) depend on a vendor distribution, so it won't > really be available to us until the vendor packages it (I think the current > version of Kudu in their runtime is 1.11.0, so I guess 1.12.0 could only be a > month or two away) > > - The other major problem we have is that vendors distributions like the one > we are using bundle about a couple of dozens products together, so if we want > to upgrade Kudu to the latest available version, we also have to upgrade > everything else, like HDFS (major upgrade from 2.6 to 3.x), Kafka (major > upgrade), HBase (major upgrade), etc, and in many cases these upgrades also > bring significant changes/deprecations in other components, like Parquet, > which means we have to change (and in some cases rewrite) our code that uses > Parquet or Kafka, since these products are rapidly evolving, and many times > in ways that break compatibility with old versions - in other words, it's a > big mess. > > > I apologize for the final rant; I understand that it is not your or Kudu's > fault, and I don't know if there's an easy solution to this conundrum within > the constraints of a vendor supported approach, but for us it makes > zero-maintenance cloud solutions attractive, at the cost of sacrificing the > flexibility and "customizability" of a in-house solution. > > Franco > > On March 30, 2020 at 2:22 PM Andrew Wong <aw...@cloudera.com> wrote: > > Alternatively, if the servers in question are under constant memory > pressure and receive a fair number of updates, they may be > prioritizing flushing of inserted rows at the expense of updates, > causing the tablets to retain a great number of WAL segments > (containing older updates) for durability's sake. > > > Just an FYI in case it helps confirm or rule it out, this refers to > KUDU-3002, which will be fixed in the upcoming release. Can you determine > whether your tablet servers are under memory pressure? > > On Mon, Mar 30, 2020 at 11:17 AM Adar Lieber-Dembo < a...@cloudera.com> wrote: > > > - the number of open files in the Kudu process in the tablet servers has > > increased to now more than 150,000 (as counted using 'lsof'); we raised the > > limit of maximum number of open files twice already to avoid a crash, but > > we (and our vendor) are concerned that something might not be right with > > such a high number of open files. > > Using lsof, can you figure out which files are open? WAL segments? > Data files? Something else? Given the high WAL usage, I'm guessing > it's the former and these are actually one and the same problem, but > would be good to confirm nonetheless. > > > - in some of the tablet servers the disk space used by the WALs is > > significantly (and concerningly) higher than in most of the other tablet > > servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on > > each tablet server, and this week was the second time where we saw a tablet > > server almost fill the whole WAL disk. We had to stop and restart the > > tablet server, so its tablets would be migrated to different TS's, and we > > could manually clean up the WALs directory, but this is definitely not > > something we would like to do in the future. We took a look inside the WAL > > directory on that TS before wiping it, and we observed that there were a > > few tablets whose WALs were in excess of 30GB. Another piece of information > > is that the table that the largest of these tablets belong to, receives > > about 15M transactions a day, of which about 25% are new inserts and the > > rest are updates of existing rows. > > Sounds like there are at least several tablets with follower replicas > that have fallen behind their leaders and are trying to catch up. In > these situations, a leader will preserve as many WAL segments as > necessary in order to catch up the lagging follower replica, at least > until some threshold is reached (at which point the master will bring > a new replica online and the lagging replica will be evicted). These > calculations are done in terms of the number of WAL segments; in the > affected tablets, do you recall how many WAL segment files there were > before you deleted the directories? > > Alternatively, if the servers in question are under constant memory > pressure and receive a fair number of updates, they may be > prioritizing flushing of inserted rows at the expense of updates, > causing the tablets to retain a great number of WAL segments > (containing older updates) for durability's sake. If you recall the > affected tablet IDs, do your logs indicate the nature of the > background operations performed for those tablets? > > Some of these questions can also be answered via Kudu metrics.There's > the ops_behind_leader tablet-level metric, which can tell you how far > behind a replica may be. Unfortunately I can't find a metric for > average number of WAL segments retained (or a histogram); I thought we > had that, but maybe not. > > > > -- > Andrew Wong > > >