Adar, Andrew, first of all thanks for looking into this. I just checked the Kudu memory usage in that TS, and it is currently at 104GB used out of a total of 128GB (about 81.5% used), so it definitely seems high to me.
As per the management manager dashboard, I wrote a quick script that parses its output, and this is what I see: - top 10 tablets by RAM anchored: curl -k -s -S https://localhost:8050/maintenance-manager | ./parse-maintenance-manager | sort -k3 -nr | head -10 FlushMRSOp(ee2abaeb126a44d0b1565c7fdd8e40da) True 10171187 0 0.0 FlushMRSOp(21d9fa855898438ab0845173fafe6b8c) True 3082813 0 0.0 FlushMRSOp(097c728dc2034407aa6f9bf71e80ee97) True 2181038 0 0.0 FlushMRSOp(de5432259dd44e3ba67e9d3f3eea0327) True 2160066 0 0.564864 FlushMRSOp(1e0ea408fc1b4ad5be0ecc9d3d5e2d1d) True 2118123 0 0.0 FlushMRSOp(8a1e4cd273ba4d22937ab25a91668ea2) True 2034237 0 0.0994112 FlushMRSOp(7b49a6c733e94b968aeb228b0cf5c0cc) True 2034237 0 0.0497825 FlushMRSOp(7bac650115f74d66a1c87ee91dc3f751) True 1688207 0 0.0414489 FlushMRSOp(a5f13d9e94aa49e7933d9f84e62ee4e3) True 1656750 0 0.441552 FlushMRSOp(897f9e2db1954523ac7781f9c45645f0) True 1562378 0 0.0845236 - top 10 tablets by log retained: curl -k -s -S https://localhost:8050/maintenance-manager | ./parse-maintenance-manager | sort -k4 -nr | head -10 FlushDeltaMemStoresOp(b3facf00fcff403293d36c1032811e6e) True 1740 33039035924 1.0 FlushDeltaMemStoresOp(354088f7954047908b4e68e0627836b8) True 763 32416265666 1.0 FlushDeltaMemStoresOp(c5369eb17772432bbe326abf902c8055) True 763 32137092792 1.0 FlushDeltaMemStoresOp(4540cba1b331429a8cbacf08acf2a321) True 763 32137092792 1.0 FlushDeltaMemStoresOp(cb302d0772a64a5795913196bdf43ed3) True 1740 32094143119 1.0 FlushDeltaMemStoresOp(4779c42e5ef6493ba4d51dc44c97f7f7) True 763 31965294100 1.0 FlushDeltaMemStoresOp(90a0465b7b4f4ed7a3c5e43d993cf52e) True 1740 31825707663 1.0 FlushDeltaMemStoresOp(f318cbf0cdcc4e10979fc1366027dde5) True 1740 31664646389 1.0 FlushDeltaMemStoresOp(a4ac51eefc59467abf45938529080c17) True 763 30805652930 1.0 FlushDeltaMemStoresOp(b00cb81794594d9b8366980a24bf79ad) True 763 30676803911 1.0 Again tablet b3facf00fcff403293d36c1032811e6e is on top of the list for log retained with about 30.77GB; at first glance though it doesn't seem to me that any of those tablets with a large amount of WALs is in the AM anchored list. Finally since we moved into April and many of those tables are range partitioned on a date column either by quarter or by month, I expect that almost all of the inserts and updates have now moved to the empty tablets for the month of April (or Q2), so I am pretty sure that those tablets whose WALs are so large have now stopped growing. We are in the process of moving one of those large replicas from last month/quarter to a different TS to try to balance things a little; I'll let you know how it goes. Yesterday we also contacted our sales engineer with the vendor to see if we can have a patch for KUDU-3002. Franco > On March 31, 2020 at 6:36 PM Andrew Wong <aw...@cloudera.com> wrote: > > The maintenance manager dashboard of the tablet server should have some > information to determine what's going on. If it shows that there are DMS > flush operations that are anchoring a significant chunk of WALs, that would > explain where some of the WAL disk usage is going. If the amount of memory > used displayed in the "Memory (detail)" page shows that the server is using a > significant amount of the memory limit, together, these two symptoms point > towards this being KUDU-3002. > > With or without a patch, it's worth checking whether there's anything > that can be done about the memory pressure. If there's tablet skew across the > cluster (some servers with many replicas, some with few), consider > rebalancing to reduce the load on the overloaded servers. At the very least, > looking at the maintenance manager dashboard should give you an idea of what > tablet replicas can be moved away from the tablet server (e.g. move the > replicas with more WALs retained by DMS flush ops, preferably to a tablet > server that has less memory being consumed, per its "Memory (detail)" page). > > On Tue, Mar 31, 2020 at 2:03 PM Adar Lieber-Dembo < a...@cloudera.com > mailto:a...@cloudera.com > wrote: > > > > Definitely seems like KUDU-3002 could apply here. Andrew, are you > > aware of a straightforward way to determine whether a given table > > (or > > cluster) would benefit from the fix for KUDU-3002? > > > > One possibility is that, if your vendor is in the business of > > providing patch releases, they could supply the fix for KUDU-3002 > > as a > > patch for your release of Kudu. Looking at the fix, it's not super > > invasive and should be pretty easy to backport into an older Kudu > > release. Have you talked to them about that? > > > > On Tue, Mar 31, 2020 at 5:58 AM Franco VENTURI < > > fvent...@comcast.net mailto:fvent...@comcast.net > wrote: > > > > > > Adar, Andrew, > > > thanks for your detailed and prompt replies. > > > > > > "Fortunately" (for your questions) we have another TS whose WALs > > disk is currently about 80% full (and three more whose WALs disk is above > > > 50%), and I suspect that it will be the next one we'll have to restart in a > > few nights. > > > > > > On this TS the output from 'lsof' this morning shows 175,550 > > files open, of which 84,876 are Kudu data files and 90,340 are Kudu WALs. > > > > > > For this server these are the top 10 tablets by WAL sizes (in kB): > > > > > > du -sk * | sort -nr | head -10 > > > 31400552 b3facf00fcff403293d36c1032811e6e > > > 31204488 354088f7954047908b4e68e0627836b8 > > > 30584928 90a0465b7b4f4ed7a3c5e43d993cf52e > > > 30536168 c5369eb17772432bbe326abf902c8055 > > > 30535900 4540cba1b331429a8cbacf08acf2a321 > > > 30503820 cb302d0772a64a5795913196bdf43ed3 > > > 30428040 f318cbf0cdcc4e10979fc1366027dde5 > > > 30379552 4779c42e5ef6493ba4d51dc44c97f7f7 > > > 29671692 a4ac51eefc59467abf45938529080c17 > > > 29539940 b00cb81794594d9b8366980a24bf79ad > > > > > > and these are the top 10 tablets by number of WAL segments: > > > > > > for t in *; do echo "$(ls $t | grep -c '^wal-') $t"; done | sort > > -nr | head -10 > > > 3813 b3facf00fcff403293d36c1032811e6e > > > 3784 354088f7954047908b4e68e0627836b8 > > > 3716 90a0465b7b4f4ed7a3c5e43d993cf52e > > > 3705 c5369eb17772432bbe326abf902c8055 > > > 3705 4540cba1b331429a8cbacf08acf2a321 > > > 3700 cb302d0772a64a5795913196bdf43ed3 > > > 3698 f318cbf0cdcc4e10979fc1366027dde5 > > > 3685 4779c42e5ef6493ba4d51dc44c97f7f7 > > > 3600 a4ac51eefc59467abf45938529080c17 > > > 3585 b00cb81794594d9b8366980a24bf79ad > > > > > > as you can see, the largest tablets by WAL size are also the > > largest ones by number of WAL segments. > > > > > > Taking a more detailed look at the largest of these tablets > > (b3facf00fcff403293d36c1032811e6e), these are the TS's that host a replica > > of that tablet from the output of the command 'kudu table list': > > > > > > T b3facf00fcff403293d36c1032811e6e > > > L e4a4195a39df41f0b04887fdcae399d8 ts07:7050 > > > V 147fcef6fb49437aa19f7a95fb26c091 ts11:7050 > > > V 59fe260f21da48059ff5683c364070ce ts31:7050 > > > > > > where ts07 (the leader) is the TS whose WALs disk is about 80% > > full. > > > > > > I looked at the 'ops_behind_leader' metric for that tablet on the > > other two TS's (ts11 and ts31) by querying their metrics, and they are both > > 0. > > > > > > > > > As for the memory pressure, the leader (ts07) shows the following > > metrics: > > > > > > leader_memory_pressure_rejections": 22397 > > > transaction_memory_pressure_rejections: 0 > > > follower_memory_pressure_rejections: 0 > > > > > > > > > Finally a couple of non-technical comments about KUDU-3002 ( > > https://issues.apache.org/jira/browse/KUDU-3002): > > > > > > - I can see it has been fixed in Kudu 1.12.0; however we (as > > probably most other enterprise customers) depend on a vendor distribution, > > so it won't really be available to us until the vendor packages it (I think > > the current version of Kudu in their runtime is 1.11.0, so I guess 1.12.0 > > could only be a month or two away) > > > > > > - The other major problem we have is that vendors distributions > > like the one we are using bundle about a couple of dozens products > > together, so if we want to upgrade Kudu to the latest available version, we > > also have to upgrade everything else, like HDFS (major upgrade from 2.6 to > > 3.x), Kafka (major upgrade), HBase (major upgrade), etc, and in many cases > > these upgrades also bring significant changes/deprecations in other > > components, like Parquet, which means we have to change (and in some cases > > rewrite) our code that uses Parquet or Kafka, since these products are > > rapidly evolving, and many times in ways that break compatibility with old > > versions - in other words, it's a big mess. > > > > > > > > > I apologize for the final rant; I understand that it is not your > > or Kudu's fault, and I don't know if there's an easy solution to this > > conundrum within the constraints of a vendor supported approach, but for us > > it makes zero-maintenance cloud solutions attractive, at the cost of > > sacrificing the flexibility and "customizability" of a in-house solution. > > > > > > Franco > > > > > > On March 30, 2020 at 2:22 PM Andrew Wong < aw...@cloudera.com > > mailto:aw...@cloudera.com > wrote: > > > > > > Alternatively, if the servers in question are under constant > > memory > > > pressure and receive a fair number of updates, they may be > > > prioritizing flushing of inserted rows at the expense of updates, > > > causing the tablets to retain a great number of WAL segments > > > (containing older updates) for durability's sake. > > > > > > > > > Just an FYI in case it helps confirm or rule it out, this refers > > to KUDU-3002, which will be fixed in the upcoming release. Can you > > determine whether your tablet servers are under memory pressure? > > > > > > On Mon, Mar 30, 2020 at 11:17 AM Adar Lieber-Dembo < > > a...@cloudera.com mailto:a...@cloudera.com > wrote: > > > > > > > - the number of open files in the Kudu process in the tablet > > servers has increased to now more than 150,000 (as counted using 'lsof'); > > we raised the limit of maximum number of open files twice already to avoid > > a crash, but we (and our vendor) are concerned that something might not be > > right with such a high number of open files. > > > > > > Using lsof, can you figure out which files are open? WAL segments? > > > Data files? Something else? Given the high WAL usage, I'm guessing > > > it's the former and these are actually one and the same problem, > > but > > > would be good to confirm nonetheless. > > > > > > > - in some of the tablet servers the disk space used by the WALs > > is significantly (and concerningly) higher than in most of the other tablet > > servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on > > each tablet server, and this week was the second time where we saw a tablet > > server almost fill the whole WAL disk. We had to stop and restart the > > tablet server, so its tablets would be migrated to different TS's, and we > > could manually clean up the WALs directory, but this is definitely not > > something we would like to do in the future. We took a look inside the WAL > > directory on that TS before wiping it, and we observed that there were a > > few tablets whose WALs were in excess of 30GB. Another piece of information > > is that the table that the largest of these tablets belong to, receives > > about 15M transactions a day, of which about 25% are new inserts and the > > rest are updates of existing rows. > > > > > > Sounds like there are at least several tablets with follower > > replicas > > > that have fallen behind their leaders and are trying to catch up. > > In > > > these situations, a leader will preserve as many WAL segments as > > > necessary in order to catch up the lagging follower replica, at > > least > > > until some threshold is reached (at which point the master will > > bring > > > a new replica online and the lagging replica will be evicted). > > These > > > calculations are done in terms of the number of WAL segments; in > > the > > > affected tablets, do you recall how many WAL segment files there > > were > > > before you deleted the directories? > > > > > > Alternatively, if the servers in question are under constant > > memory > > > pressure and receive a fair number of updates, they may be > > > prioritizing flushing of inserted rows at the expense of updates, > > > causing the tablets to retain a great number of WAL segments > > > (containing older updates) for durability's sake. If you recall > > the > > > affected tablet IDs, do your logs indicate the nature of the > > > background operations performed for those tablets? > > > > > > Some of these questions can also be answered via Kudu > > metrics.There's > > > the ops_behind_leader tablet-level metric, which can tell you how > > far > > > behind a replica may be. Unfortunately I can't find a metric for > > > average number of WAL segments retained (or a histogram); I > > thought we > > > had that, but maybe not. > > > > > > > > > > > > -- > > > Andrew Wong > > > > > > > > > > > > > > > > -- > Andrew Wong >