[ https://issues.apache.org/jira/browse/KUDU-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617244#comment-17617244 ]
Ashwani Raina commented on KUDU-3406: ------------------------------------- I did some tryout using the real data. In my case, sometimes memory usage peaked to even 38GB while running DiskRowSetCompaction mostly. While kudu-master was running, I also collected some stack traces at the frequency of 10 per second. For most of them, MaintenanceMgr is found to be in DoMergeCompactionOrFlush() roughly around the peak memory usage time. There are other stack traces as well around the peak time: Here are some of the traces: +++ stacks.103:TID 2987608(MaintenanceMgr ): stacks.103- @ 0x55d0f0aa5337 kudu::tablet::DeltaPreparer<>::CollectMutations() stacks.103- @ 0x55d0f0a97439 kudu::tablet::DeltaFileIterator<>::CollectMutations() stacks.103- @ 0x55d0f0ad2f74 kudu::tablet::DeltaIteratorMerger::CollectMutations() stacks.103- @ 0x55d0f0a20f6b kudu::tablet::(anonymous namespace)::DiskRowSetCompactionInput::PrepareBlock() stacks.103- @ 0x55d0f0a1c9fd kudu::tablet::(anonymous namespace)::MergeCompactionInput::ProcessEmptyInputs() stacks.103- @ 0x55d0f0a1cf75 kudu::tablet::(anonymous namespace)::MergeCompactionInput::FinishBlock() stacks.103- @ 0x55d0f0a216cf kudu::tablet::FlushCompactionInput() stacks.103- @ 0x55d0f09d347a kudu::tablet::Tablet::DoMergeCompactionOrFlush() --- stacks.130:TID 2987608(MaintenanceMgr ): stacks.130- @ 0x55d0f0a21057 kudu::tablet::(anonymous namespace)::DiskRowSetCompactionInput::PrepareBlock() stacks.130- @ 0x55d0f0a1c9fd kudu::tablet::(anonymous namespace)::MergeCompactionInput::ProcessEmptyInputs() stacks.130- @ 0x55d0f0a1cf75 kudu::tablet::(anonymous namespace)::MergeCompactionInput::FinishBlock() stacks.130- @ 0x55d0f0a216cf kudu::tablet::FlushCompactionInput() stacks.130- @ 0x55d0f09d347a kudu::tablet::Tablet::DoMergeCompactionOrFlush() stacks.130- @ 0x55d0f09d7435 kudu::tablet::Tablet::Compact() stacks.130- @ 0x55d0f09efe3d kudu::tablet::CompactRowSetsOp::Perform() stacks.130- @ 0x55d0f2b410cd kudu::MaintenanceManager::LaunchOp() stacks.130- @ 0x55d0f2ba91f0 kudu::ThreadPool::DispatchThread() --- stacks.137:TID 2987608(MaintenanceMgr ): stacks.137- @ 0x55d0f0a1e110 kudu::tablet::RemoveAncientUndos() stacks.137- @ 0x55d0f0a21580 kudu::tablet::FlushCompactionInput() stacks.137- @ 0x55d0f09d347a kudu::tablet::Tablet::DoMergeCompactionOrFlush() stacks.137- @ 0x55d0f09d7435 kudu::tablet::Tablet::Compact() stacks.137- @ 0x55d0f09efe3d kudu::tablet::CompactRowSetsOp::Perform() stacks.137- @ 0x55d0f2b410cd kudu::MaintenanceManager::LaunchOp() +++ Next steps: # Look into MemRowSet flush code and compaction policy. # Try to generate similar scenario with local data generation which is a mix of table create/drop, frequent updates from tablet servers to master. > CompactRowSetsOp can allocate much more memory than specified by the hard > memory limit > -------------------------------------------------------------------------------------- > > Key: KUDU-3406 > URL: https://issues.apache.org/jira/browse/KUDU-3406 > Project: Kudu > Issue Type: Bug > Components: master, tserver > Affects Versions: 1.13.0, 1.14.0, 1.15.0, 1.16.0 > Reporter: Alexey Serbin > Assignee: Ashwani Raina > Priority: Critical > Labels: compaction, stability > Attachments: 270.svg, 283.svg, 296.svg, 308.svg, 332.svg, 344.svg, > fs_list.before > > > In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}} > and {{kudu-tserver}} processes grow far beyond the hard memory limit > (controlled by the {{\-\-memory_limit_hard_bytes}} flag) when running > CompactRowSetsOp. In some cases, a Kudu server process consumes all the > available memory, so that the OS might invoke the OOM killer. > At this point I'm not yet sure about the exact versions affected, and what > leads to accumulating so much data in flushed rowsets, but I know that 1.13, > 1.14, 1.15 and 1.16 are affected. It's also not clear whether the actual > regression is in allowing the flushed rowsets growing that big. > There is a reproduction scenario for this bug with {{kudu-master}} using the > real data from the fields. With that data, {{kudu fs list}} reveals a rowset > with many UNDOs: see the attached {{fs_list.before}} file. When starting > {{kudu-master}} with the data, the process memory usage eventually peaked > with about 25GByte of RSS while running CompactRowSetsOp, and then the RSS > eventually subsides down to about 200MByte once the CompactRowSetsOp is > completed. > I also attached several SVG files generated by the TCMalloc's pprof from the > memory profile snapshots output by {{kudu-master}} when configured to dump > allocation stats every 512 MBytes. I generated the SVG reports for profiles > attributed to the highest memory usage: > {noformat} > Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB > currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB > allocated cumulatively, 13221 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB > allocated cumulatively, 12110 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB > allocated cumulatively, 12406 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB > allocated cumulatively, 23884 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB > allocated cumulatively, 12648 MB currently in use) > {noformat} > The report from the compaction doesn't look like anything extraordinary > (except for the duration): > {noformat} > I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P > 68dbea0ec022440d9fc282099a8656cb: > CompactRowSetsOp(00000000000000000000000000000000) complete. Timing: real > 522.617s user 471.783s sys 46.588s Metrics: > {"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307} > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)