Re: Tablet Server with almost 1TB of WALs (and large number of open files)

Franco VENTURI Wed, 01 Apr 2020 13:03:23 -0700

Adar, Andrew,
first of all thanks for looking into this.
I just checked the Kudu memory usage in that TS, and it is currently at 104GB 
used out of a total of 128GB (about 81.5% used), so it definitely seems high to 
me.



As per the management manager dashboard, I wrote a quick script that parses its 
output, and this is what I see:

- top 10 tablets by RAM anchored:
curl -k -s -S https://localhost:8050/maintenance-manager | 
./parse-maintenance-manager | sort -k3 -nr | head -10
FlushMRSOp(ee2abaeb126a44d0b1565c7fdd8e40da) True 10171187 0 0.0
FlushMRSOp(21d9fa855898438ab0845173fafe6b8c) True 3082813 0 0.0
FlushMRSOp(097c728dc2034407aa6f9bf71e80ee97) True 2181038 0 0.0
FlushMRSOp(de5432259dd44e3ba67e9d3f3eea0327) True 2160066 0 0.564864
FlushMRSOp(1e0ea408fc1b4ad5be0ecc9d3d5e2d1d) True 2118123 0 0.0
FlushMRSOp(8a1e4cd273ba4d22937ab25a91668ea2) True 2034237 0 0.0994112
FlushMRSOp(7b49a6c733e94b968aeb228b0cf5c0cc) True 2034237 0 0.0497825
FlushMRSOp(7bac650115f74d66a1c87ee91dc3f751) True 1688207 0 0.0414489
FlushMRSOp(a5f13d9e94aa49e7933d9f84e62ee4e3) True 1656750 0 0.441552
FlushMRSOp(897f9e2db1954523ac7781f9c45645f0) True 1562378 0 0.0845236

- top 10 tablets by log retained:
curl -k -s -S https://localhost:8050/maintenance-manager | 
./parse-maintenance-manager | sort -k4 -nr | head -10
FlushDeltaMemStoresOp(b3facf00fcff403293d36c1032811e6e) True 1740 33039035924 
1.0
FlushDeltaMemStoresOp(354088f7954047908b4e68e0627836b8) True 763 32416265666 1.0
FlushDeltaMemStoresOp(c5369eb17772432bbe326abf902c8055) True 763 32137092792 1.0
FlushDeltaMemStoresOp(4540cba1b331429a8cbacf08acf2a321) True 763 32137092792 1.0
FlushDeltaMemStoresOp(cb302d0772a64a5795913196bdf43ed3) True 1740 32094143119 
1.0
FlushDeltaMemStoresOp(4779c42e5ef6493ba4d51dc44c97f7f7) True 763 31965294100 1.0
FlushDeltaMemStoresOp(90a0465b7b4f4ed7a3c5e43d993cf52e) True 1740 31825707663 
1.0
FlushDeltaMemStoresOp(f318cbf0cdcc4e10979fc1366027dde5) True 1740 31664646389 
1.0
FlushDeltaMemStoresOp(a4ac51eefc59467abf45938529080c17) True 763 30805652930 1.0
FlushDeltaMemStoresOp(b00cb81794594d9b8366980a24bf79ad) True 763 30676803911 1.0

Again tablet b3facf00fcff403293d36c1032811e6e is on top of the list for log 
retained with about 30.77GB; at first glance though it doesn't seem to me that 
any of those tablets with a large amount of WALs is in the AM anchored list.

Finally since we moved into April and many of those tables are range 
partitioned on a date column either by quarter or by month, I expect that 
almost all of the inserts and updates have now moved to the empty tablets for 
the month of April (or Q2), so I am pretty sure that those tablets whose WALs 
are so large have now stopped growing.
We are in the process of moving one of those large replicas from last 
month/quarter to a different TS to try to balance things a little; I'll let you 
know how it goes.

Yesterday we also contacted our sales engineer with the vendor to see if we can 
have a patch for KUDU-3002.

Franco


> On March 31, 2020 at 6:36 PM Andrew Wong <aw...@cloudera.com> wrote:
> 
>     The maintenance manager dashboard of the tablet server should have some 
> information to determine what's going on. If it shows that there are DMS 
> flush operations that are anchoring a significant chunk of WALs, that would 
> explain where some of the WAL disk usage is going. If the amount of memory 
> used displayed in the "Memory (detail)" page shows that the server is using a 
> significant amount of the memory limit, together, these two symptoms point 
> towards this being KUDU-3002.
> 
>     With or without a patch, it's worth checking whether there's anything 
> that can be done about the memory pressure. If there's tablet skew across the 
> cluster (some servers with many replicas, some with few), consider 
> rebalancing to reduce the load on the overloaded servers. At the very least, 
> looking at the maintenance manager dashboard should give you an idea of what 
> tablet replicas can be moved away from the tablet server (e.g. move the 
> replicas with more WALs retained by DMS flush ops, preferably to a tablet 
> server that has less memory being consumed, per its "Memory (detail)" page).
> 
>     On Tue, Mar 31, 2020 at 2:03 PM Adar Lieber-Dembo < a...@cloudera.com 
> mailto:a...@cloudera.com > wrote:
> 
>         > > Definitely seems like KUDU-3002 could apply here. Andrew, are you
> >         aware of a straightforward way to determine whether a given table 
> > (or
> >         cluster) would benefit from the fix for KUDU-3002?
> > 
> >         One possibility is that, if your vendor is in the business of
> >         providing patch releases, they could supply the fix for KUDU-3002 
> > as a
> >         patch for your release of Kudu. Looking at the fix, it's not super
> >         invasive and should be pretty easy to backport into an older Kudu
> >         release. Have you talked to them about that?
> > 
> >         On Tue, Mar 31, 2020 at 5:58 AM Franco VENTURI < 
> > fvent...@comcast.net mailto:fvent...@comcast.net > wrote:
> >         >
> >         > Adar, Andrew,
> >         > thanks for your detailed and prompt replies.
> >         >
> >         > "Fortunately" (for your questions) we have another TS whose WALs 
> > disk is currently about 80% full (and three more whose WALs disk is above > 
> > 50%), and I suspect that it will be the next one we'll have to restart in a 
> > few nights.
> >         >
> >         > On this TS the output from 'lsof' this morning shows 175,550 
> > files open, of which 84,876 are Kudu data files and 90,340 are Kudu WALs.
> >         >
> >         > For this server these are the top 10 tablets by WAL sizes (in kB):
> >         >
> >         > du -sk * | sort -nr | head -10
> >         > 31400552 b3facf00fcff403293d36c1032811e6e
> >         > 31204488 354088f7954047908b4e68e0627836b8
> >         > 30584928 90a0465b7b4f4ed7a3c5e43d993cf52e
> >         > 30536168 c5369eb17772432bbe326abf902c8055
> >         > 30535900 4540cba1b331429a8cbacf08acf2a321
> >         > 30503820 cb302d0772a64a5795913196bdf43ed3
> >         > 30428040 f318cbf0cdcc4e10979fc1366027dde5
> >         > 30379552 4779c42e5ef6493ba4d51dc44c97f7f7
> >         > 29671692 a4ac51eefc59467abf45938529080c17
> >         > 29539940 b00cb81794594d9b8366980a24bf79ad
> >         >
> >         > and these are the top 10 tablets by number of WAL segments:
> >         >
> >         > for t in *; do echo "$(ls $t | grep -c '^wal-') $t"; done | sort 
> > -nr | head -10
> >         > 3813 b3facf00fcff403293d36c1032811e6e
> >         > 3784 354088f7954047908b4e68e0627836b8
> >         > 3716 90a0465b7b4f4ed7a3c5e43d993cf52e
> >         > 3705 c5369eb17772432bbe326abf902c8055
> >         > 3705 4540cba1b331429a8cbacf08acf2a321
> >         > 3700 cb302d0772a64a5795913196bdf43ed3
> >         > 3698 f318cbf0cdcc4e10979fc1366027dde5
> >         > 3685 4779c42e5ef6493ba4d51dc44c97f7f7
> >         > 3600 a4ac51eefc59467abf45938529080c17
> >         > 3585 b00cb81794594d9b8366980a24bf79ad
> >         >
> >         > as you can see, the largest tablets by WAL size are also the 
> > largest ones by number of WAL segments.
> >         >
> >         > Taking a more detailed look at the largest of these tablets 
> > (b3facf00fcff403293d36c1032811e6e), these are the TS's that host a replica 
> > of that tablet from the output of the command 'kudu table list':
> >         >
> >         > T b3facf00fcff403293d36c1032811e6e
> >         > L e4a4195a39df41f0b04887fdcae399d8 ts07:7050
> >         > V 147fcef6fb49437aa19f7a95fb26c091 ts11:7050
> >         > V 59fe260f21da48059ff5683c364070ce ts31:7050
> >         >
> >         > where ts07 (the leader) is the TS whose WALs disk is about 80% 
> > full.
> >         >
> >         > I looked at the 'ops_behind_leader' metric for that tablet on the 
> > other two TS's (ts11 and ts31) by querying their metrics, and they are both 
> > 0.
> >         >
> >         >
> >         > As for the memory pressure, the leader (ts07) shows the following 
> > metrics:
> >         >
> >         > leader_memory_pressure_rejections": 22397
> >         > transaction_memory_pressure_rejections: 0
> >         > follower_memory_pressure_rejections: 0
> >         >
> >         >
> >         > Finally a couple of non-technical comments about KUDU-3002 ( 
> > https://issues.apache.org/jira/browse/KUDU-3002):
> >         >
> >         > - I can see it has been fixed in Kudu 1.12.0; however we (as 
> > probably most other enterprise customers) depend on a vendor distribution, 
> > so it won't really be available to us until the vendor packages it (I think 
> > the current version of Kudu in their runtime is 1.11.0, so I guess 1.12.0 
> > could only be a month or two away)
> >         >
> >         > - The other major problem we have is that vendors distributions 
> > like the one we are using bundle about a couple of dozens products 
> > together, so if we want to upgrade Kudu to the latest available version, we 
> > also have to upgrade everything else, like HDFS (major upgrade from 2.6 to 
> > 3.x), Kafka (major upgrade), HBase (major upgrade), etc, and in many cases 
> > these upgrades also bring significant changes/deprecations in other 
> > components, like Parquet, which means we have to change (and in some cases 
> > rewrite) our code that uses Parquet or Kafka, since these products are 
> > rapidly evolving, and many times in ways that break compatibility with old 
> > versions - in other words, it's a big mess.
> >         >
> >         >
> >         > I apologize for the final rant; I understand that it is not your 
> > or Kudu's fault, and I don't know if there's an easy solution to this 
> > conundrum within the constraints of a vendor supported approach, but for us 
> > it makes zero-maintenance cloud solutions attractive, at the cost of 
> > sacrificing the flexibility and "customizability" of a in-house solution.
> >         >
> >         > Franco
> >         >
> >         > On March 30, 2020 at 2:22 PM Andrew Wong < aw...@cloudera.com 
> > mailto:aw...@cloudera.com > wrote:
> >         >
> >         > Alternatively, if the servers in question are under constant 
> > memory
> >         > pressure and receive a fair number of updates, they may be
> >         > prioritizing flushing of inserted rows at the expense of updates,
> >         > causing the tablets to retain a great number of WAL segments
> >         > (containing older updates) for durability's sake.
> >         >
> >         >
> >         > Just an FYI in case it helps confirm or rule it out, this refers 
> > to KUDU-3002, which will be fixed in the upcoming release. Can you 
> > determine whether your tablet servers are under memory pressure?
> >         >
> >         > On Mon, Mar 30, 2020 at 11:17 AM Adar Lieber-Dembo < 
> > a...@cloudera.com mailto:a...@cloudera.com > wrote:
> >         >
> >         > > - the number of open files in the Kudu process in the tablet 
> > servers has increased to now more than 150,000 (as counted using 'lsof'); 
> > we raised the limit of maximum number of open files twice already to avoid 
> > a crash, but we (and our vendor) are concerned that something might not be 
> > right with such a high number of open files.
> >         >
> >         > Using lsof, can you figure out which files are open? WAL segments?
> >         > Data files? Something else? Given the high WAL usage, I'm guessing
> >         > it's the former and these are actually one and the same problem, 
> > but
> >         > would be good to confirm nonetheless.
> >         >
> >         > > - in some of the tablet servers the disk space used by the WALs 
> > is significantly (and concerningly) higher than in most of the other tablet 
> > servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on 
> > each tablet server, and this week was the second time where we saw a tablet 
> > server almost fill the whole WAL disk. We had to stop and restart the 
> > tablet server, so its tablets would be migrated to different TS's, and we 
> > could manually clean up the WALs directory, but this is definitely not 
> > something we would like to do in the future. We took a look inside the WAL 
> > directory on that TS before wiping it, and we observed that there were a 
> > few tablets whose WALs were in excess of 30GB. Another piece of information 
> > is that the table that the largest of these tablets belong to, receives 
> > about 15M transactions a day, of which about 25% are new inserts and the 
> > rest are updates of existing rows.
> >         >
> >         > Sounds like there are at least several tablets with follower 
> > replicas
> >         > that have fallen behind their leaders and are trying to catch up. 
> > In
> >         > these situations, a leader will preserve as many WAL segments as
> >         > necessary in order to catch up the lagging follower replica, at 
> > least
> >         > until some threshold is reached (at which point the master will 
> > bring
> >         > a new replica online and the lagging replica will be evicted). 
> > These
> >         > calculations are done in terms of the number of WAL segments; in 
> > the
> >         > affected tablets, do you recall how many WAL segment files there 
> > were
> >         > before you deleted the directories?
> >         >
> >         > Alternatively, if the servers in question are under constant 
> > memory
> >         > pressure and receive a fair number of updates, they may be
> >         > prioritizing flushing of inserted rows at the expense of updates,
> >         > causing the tablets to retain a great number of WAL segments
> >         > (containing older updates) for durability's sake. If you recall 
> > the
> >         > affected tablet IDs, do your logs indicate the nature of the
> >         > background operations performed for those tablets?
> >         >
> >         > Some of these questions can also be answered via Kudu 
> > metrics.There's
> >         > the ops_behind_leader tablet-level metric, which can tell you how 
> > far
> >         > behind a replica may be. Unfortunately I can't find a metric for
> >         > average number of WAL segments retained (or a histogram); I 
> > thought we
> >         > had that, but maybe not.
> >         >
> >         >
> >         >
> >         > --
> >         > Andrew Wong
> >         >
> >         >
> >         >
> > 
> >     > 
> 
>     --
>     Andrew Wong
>

Re: Tablet Server with almost 1TB of WALs (and large number of open files)

Reply via email to