Re:Re: Re: Why RowSet size is much smaller than flush_threshold_mb

Quanlong Huang Wed, 01 Aug 2018 16:52:47 -0700

In my experience, when I found the performance is below my expectation, I'd 
like to tune flags listed in 
https://kudu.apache.org/docs/configuration_reference.html , which needs a clear 
understanding of kudu internals. Maybe we can add the link there?



At 2018-08-02 01:06:40，"Todd Lipcon" <t...@cloudera.com> wrote：

On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang <huang_quanl...@126.com> wrote:

Hi Todd and William,


I'm really appreciated for your help and sorry for my late reply. I was going 
to reply with some follow-up questions but was assigned to focus some other 
works... Now I'm back to this work.


The design docs are really helpful. Now I understand the flush and compaction. 
I think we can add a link to these design docs in the kudu documentation page, 
so users who want to dig deeper can know more about kudu internal.


Personally, since starting the project, I have had the philosophy that the 
user-facing documentation should remain simple and not discuss internals too 
much. I found in some other open source projects that there isn't a clear 
difference between user documentation and developer documentation, and users 
can easily get confused by all of the internal details. Or, users may start to 
believe that Kudu is very complex and they need to understand knapsack problem 
approximation algorithms in order to operate it. So, normally we try to avoid 
exposing too much of the details.


That said, I think it is a good idea to add a small note in the documentation 
somewhere that links to the design docs, maybe with some sentence explaining 
that understanding internals is not necessary to operate Kudu, but that expert 
users may find the internal design useful as a reference? I would be curious to 
hear what other users think about how best to make this trade-off.


-Todd
 
At 2018-06-15 23:41:17, "Todd Lipcon" <t...@cloudera.com> wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of 
separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by 
default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of 
compactions, because each of these 32MB rowsets is non-overlapping. In other 
words, if your MRS contains rows A-Z, the output RowSets will include [A-C], 
[D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will never need to 
be compacted with each other. The net result, here, is that compaction becomes 
more fine-grained and only needs to operate on sub-ranges of the tablet where 
there is a lot of overlap.


You can read more about this in docs/design-docs/compaction-policy.md, in 
particular the section "Limiting RowSet Sizes"


Hope that helps
-Todd


On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wdberke...@gmail.com> wrote:

The op seen in the logs is a rowset compaction, which takes existing 
diskrowsets and rewrites them. It's not a flush, which writes data in memory to 
disk, so I don't think the flush_threshold_mb is relevant. Rowset compaction is 
done to reduce the amount of overlap of rowsets in primary key space, i.e. 
reduce the number of rowsets that might need to be checked to enforce the 
primary key constraint or find a row. Having lots of rowset compaction 
indicates that rows are being written in a somewhat random order w.r.t the 
primary key order. Kudu will perform much better as writes scale when rows are 
inserted roughly in increasing order per tablet.


Also, because you are using the log block manager (the default and only one 
suitable for production deployments), there isn't a 1-1 relationship between 
cfiles or diskrowsets and files on the filesystem. Many cfiles and diskrowsets 
will be put together in a container file.


Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)


The metrics from the compact row sets op indicates the time is spent in 
fdatasync and in reading (likely reading the original rowsets). The overall 
compaction time is kinda long but not crazy long. What's the performance you 
are seeing and what is the performance you would like to see?


-Will


On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <huang_quanl...@126.com> wrote:

Hi all,


I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, 
I find most of the compactions are compacting small files (~40MB for each). For 
example:


I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 
70f3e54fe0f3490cbf0371a6830a33a7: Compaction: stage 1 complete, picked 4 
rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: 
~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: 
~34720852 bytes)
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: 
~29914833 bytes)
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: 
~29007249 bytes)
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 
70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 1 (flushing 
snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 
or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 
124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 
124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 
124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 
124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 
70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 2 (starting to 
duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 
70f3e54fe0f3490cbf0371a6830a33a7: Compaction Phase 2: carrying over any updates 
which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 
70f3e54fe0f3490cbf0371a6830a33a7: Phase 2 snapshot: MvccSnapshot[committed={T|T 
< 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 
70f3e54fe0f3490cbf0371a6830a33a7: Compaction successful on 82987 rows 
(123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running 
CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628suser 1.460ssys 
0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 
70f3e54fe0f3490cbf0371a6830a33a7: 
CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: 
{"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
 dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data 
dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}


The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed 
file size be ~1GB?


I think increasing the initial RowSet size can reduce compactions and then 
reduce the impact of other ongoing operations. It may also improve the flush 
performance. Is that right? If so, how can I increase the RowSet size?


I'd be grateful if someone can make me clear about these!


Thanks,
Quanlong







--

Todd Lipcon
Software Engineer, Cloudera





--

Todd Lipcon
Software Engineer, Cloudera

Re:Re: Re: Why RowSet size is much smaller than flush_threshold_mb

Reply via email to