[ 
https://issues.apache.org/jira/browse/KUDU-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3406:
--------------------------------
    Description: 
In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}} and 
{{kudu-tserver}} processes grow far beyond the hard memory limit (controlled by 
the {{\-\-memory_limit_hard_bytes}} flag) when running CompactRowSetsOp.  In 
some cases, a Kudu server process consumes all the available memory, so that 
the OS might invoke the OOM killer.

At this point I'm not yet sure about the exact versions affected, and what 
leads to accumulating so much data in flushed rowsets, but I know that 1.13, 
1.14, 1.15 and 1.16 are  affected.  It's also not clear whether the actual 
regression is in allowing the flushed rowsets growing that big.

There is a reproduction scenario for this bug with {{kudu-master}} using the 
real data from the fields.  With that data, {{kudu fs list}} reveals a rowset 
with many UNDOs: see the attached {{fs_list.before}} file.  When starting 
{{kudu-master}} with the data, the process memory usage eventually peaked with 
about 25GByte of RSS while running CompactRowSetsOp, and then the RSS 
eventually subsides down to about 200MByte once the CompactRowSetsOp is 
completed.

I also attached several SVG files generated by the TCMalloc's pprof from the 
memory profile snapshots output by {{kudu-master}} when configured to dump 
allocation stats every 512 MBytes.  I generated the SVG reports for profiles 
attributed to the highest memory usage:
{noformat}
Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB 
currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB 
allocated cumulatively, 13221 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB 
allocated cumulatively, 12110 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB 
allocated cumulatively, 12406 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB 
allocated cumulatively, 23884 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB 
allocated cumulatively, 12648 MB currently in use)
{noformat}

The report from the compaction doesn't look like anything extraordinary (except 
for the duration):
{noformat}
I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P 
68dbea0ec022440d9fc282099a8656cb: 
CompactRowSetsOp(00000000000000000000000000000000) complete. Timing: real 
522.617s     user 471.783s   sys 46.588s Metrics: 
{"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307}
{noformat}

  was:
In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}} and 
{{kudu-tserver}} processes grow far beyond the hard memory limit (controlled by 
the {{\-\-memory_limit_hard_bytes}} flag) when running CompactRowSetsOp.  In 
some cases, a Kudu server process consumes all the available memory, so that 
the OS might invoke the OOM killer.

At this point I'm not yet sure about the exact versions affected, and what 
leads to accumulating so much data in flushed rowsets, but I know that 1.13, 
1.14, 1.15 and 1.16 are  affected.  It's also not clear whether the actual 
regression is in allowing the flushed rowsets growing that big.

There is a reproduction scenario for this bug with {{kudu-master}} using the 
real data from the fields.  With that data, {{kudu fs list}} reveals a rowset 
with many UNDOs: see the attached {{fs_list.before}} file.  When starting 
{{kudu-master}} with the data, the process memory usage eventually peaked with 
about 25GByte of RSS while running CompactRowSetsOp.

I also attached several SVG files generated by the TCMalloc's pprof from the 
memory profile snapshots output by {{kudu-master}} when configured to dump 
allocation stats every 512 MBytes.  I generated the SVG reports for profiles 
attributed to the highest memory usage:
{noformat}
Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB 
currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB 
allocated cumulatively, 13221 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB 
allocated cumulatively, 12110 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB 
allocated cumulatively, 12406 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB 
allocated cumulatively, 23884 MB currently in use)
Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB 
allocated cumulatively, 12648 MB currently in use)
{noformat}

The report from the compaction doesn't look like anything extraordinary (except 
for the duration):
{noformat}
I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P 
68dbea0ec022440d9fc282099a8656cb: 
CompactRowSetsOp(00000000000000000000000000000000) complete. Timing: real 
522.617s     user 471.783s   sys 46.588s Metrics: 
{"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307}
{noformat}


> CompactRowSetsOp can allocate much more memory than specified by the hard 
> memory limit
> --------------------------------------------------------------------------------------
>
>                 Key: KUDU-3406
>                 URL: https://issues.apache.org/jira/browse/KUDU-3406
>             Project: Kudu
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.13.0, 1.14.0, 1.15.0, 1.16.0
>            Reporter: Alexey Serbin
>            Assignee: Ashwani Raina
>            Priority: Critical
>              Labels: compaction, stability
>         Attachments: 270.svg, 283.svg, 296.svg, 308.svg, 332.svg, 344.svg, 
> fs_list.before
>
>
> In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}} 
> and {{kudu-tserver}} processes grow far beyond the hard memory limit 
> (controlled by the {{\-\-memory_limit_hard_bytes}} flag) when running 
> CompactRowSetsOp.  In some cases, a Kudu server process consumes all the 
> available memory, so that the OS might invoke the OOM killer.
> At this point I'm not yet sure about the exact versions affected, and what 
> leads to accumulating so much data in flushed rowsets, but I know that 1.13, 
> 1.14, 1.15 and 1.16 are  affected.  It's also not clear whether the actual 
> regression is in allowing the flushed rowsets growing that big.
> There is a reproduction scenario for this bug with {{kudu-master}} using the 
> real data from the fields.  With that data, {{kudu fs list}} reveals a rowset 
> with many UNDOs: see the attached {{fs_list.before}} file.  When starting 
> {{kudu-master}} with the data, the process memory usage eventually peaked 
> with about 25GByte of RSS while running CompactRowSetsOp, and then the RSS 
> eventually subsides down to about 200MByte once the CompactRowSetsOp is 
> completed.
> I also attached several SVG files generated by the TCMalloc's pprof from the 
> memory profile snapshots output by {{kudu-master}} when configured to dump 
> allocation stats every 512 MBytes.  I generated the SVG reports for profiles 
> attributed to the highest memory usage:
> {noformat}
> Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB 
> currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB 
> allocated cumulatively, 13221 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB 
> allocated cumulatively, 12110 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB 
> allocated cumulatively, 12406 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB 
> allocated cumulatively, 23884 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB 
> allocated cumulatively, 12648 MB currently in use)
> {noformat}
> The report from the compaction doesn't look like anything extraordinary 
> (except for the duration):
> {noformat}
> I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P 
> 68dbea0ec022440d9fc282099a8656cb: 
> CompactRowSetsOp(00000000000000000000000000000000) complete. Timing: real 
> 522.617s     user 471.783s   sys 46.588s Metrics: 
> {"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to