On Wed, Nov 30, 2016 at 6:26 AM, Weber, Richard <riwe...@akamai.com> wrote:

> Hi All,
>
> I'm trying to figure out the right/best/easiest way to find out how much
> space that a given table is taking up on the various tablet servers.  I'm
> looking really at finding:
> * Physical space taken on all disks
> * Logical space taken on all disks
> * Sizing of Indices/Bloom Filters, etc.
> * Sizing with and without replication.
>
> I'm trying to run an apples vs apples comparison of how big data is when
> stored in Kudu compared to storing it in it's native format (Gzipped CSV)
> as well as in Parquet format on HDFS.  Ultimately, I'd like to be able to
> do reporting on the different tables to say Table X is taking up Y Tb,
> where Y consists of A physical size, B Index, C Bloom, etc.
>
> Looking through the Web UI I don't really see any good summary of how much
> space the entire table is taking.  It seems like I'd need to walk through
> each Tablet server, connect to the metrics page and generate the summary
> information myself.
>
>
Yea, unfortunately we do not expose much of this information in a useful
way at the moment. The metrics page is the best source of info for the
various sizes, and even those are often estimates rather than always being
accurate at the moment.

In terms of cross-server metrics aggregation, it's been our philosophy so
far that we should try to avoid doing a poor job of things that other
systems are likely to do better -- metrics aggregation being one such
thing. It's likely we'll add simple aggregation of table sizes, since that
info is very useful for SQL engines to do JOIN ordering, but I don't think
we'd start adding the more granular breakdowns like indexes, blooms, etc.

If your use case is a one-time experiment to understand the data volumes,
it would be pretty straightforward to write a tool to do this kind of
summary against the on-disk metadata of a tablet server. For example, you
can load the tablet metadata, group the blocks by type/column, and then
aggregate as you prefer. Unfortunately this would give you only the
physical size and not the logical, since you'd have to scan the actual data
to know its uncompressed sizes.

If you have any interest in helping to build such a tool I'd be happy to
point you in the right direction. Otherwise let's file a JIRA to add this
as a new feature in a future release.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to