[Impala-ASF-CR] Major update to Impala + Kudu page

Dimitris Tsirogiannis (Code Review) Fri, 13 Jan 2017 15:51:06 -0800

Dimitris Tsirogiannis has posted comments on this change.

Change subject: Major update to Impala + Kudu page
......................................................................



Patch Set 4:

(31 comments)

Initial round of comments.

http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_explain.xml
File docs/topics/impala_explain.xml:

PS4, Line 239: he <codeph>EXPLAIN</codeph> statement displays equivalent plan
             :       information for queries against Kudu tables as for queries
             :       against HDFS-based tables.
Don't we need to talk about the predicates that get pushed to Kudu?


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_invalidate_metadata.xml
File docs/topics/impala_invalidate_metadata.xml:

PS4, Line 245: <!-- Anything to say for INVALIDATE METADATA with Kudu? -->
             :     <!-- <p rev="kudu" 
conref="../shared/impala_common.xml#common/kudu_blurb"/> -->
Where do we talk about Kudu table metadata? We need to mention that source of 
truth for table metadata (e.g. schema, partitions) is Kudu and not HMS. 
Invalidate metadata and refresh cause the table metadata to be reloaded from 
Kudu and any external schema changes to be reflected to the catalog.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS4, Line 149: tablet servers is independent of the number of DataNodes in the 
cluster.
One question that someone might have is whether tablet servers and data nodes 
need to be collocated or not for functional or performance reasons. It would be 
nice to comment on that here.


PS4, Line 201: For a
             :           partitioned Kudu table,
Remove. There are only partitioned Kudu tables.


PS4, Line 221: For example, if an <codeph>INSERT</codeph> operation fails 
partway through, only some of the
             :               new rows might be present in the table. You can 
re-run the same <codeph>INSERT</codeph>, and
             :               only the missing rows will be added. Or if data in 
the table is stale, you can run an
             :               <codeph>UPSERT</codeph> statement that brings the 
data up to date, without the possibility
             :               of creating duplicate copies of existing rows.
Why blending the semantics of upsert and insert with the description of primary 
keys? I don't think this section belongs here.


PS4, Line 233: These restrictions on the primary key columns
How about the nullability restrictions on non-primary key columns. Where are 
they enforced?


PS4, Line 247: <p>
             :           Kudu can do extra optimizations for queries that refer 
to the primary key columns in
             :           the <codeph>WHERE</codeph> clause. It is not required 
though to include the primary
             :           key columns in the <codeph>WHERE</codeph> clause of 
every query.
             :         </p>
I am not sure I understand the purpose of this paragraph here. This either 
belongs to the place where you discuss schema design or even better performance 
tuning.


PS4, Line 269: PRIMARY KEY
             : | [NOT] NULL
             : | ENCODING <varname>codec</varname>
             : | COMPRESSION <varname>algorithm</varname>
             : | DEFAULT <varname>expression</varname>
             : | BLOCK_SIZE <varname>number</varname>
Wouldn't it make more sense to first describe the Kudu-specific CREATE TABLE 
syntax and then analyze the individual components of it?


PS4, Line 299: You can specify the <codeph>PRIMARY KEY</codeph> attribute 
either inline in a single
             :             column definition, or as a separate clause at the 
end of the column list:
Even though this is the place to formally define the syntax for PRIMARY KEYs, 
you seem to have more detailed information about PKs in the paragraph in L238. 
I would remove that paragraph entirely and move that content here.


PS4, Line 360: <p>
             :             The notion of primary key only applies to Kudu 
tables. Every Kudu table requires a
             :             primary key. The primary key consists of one or more 
columns. You must specify any
             :             primary key columns first in the column list.
             :           </p>
Move to the beginning or in the general description of primary keys for Kudu 
tables.


PS4, Line 368: Therefore, only
             :             include a column in the primary key specification if 
you are certain of its value at
             :             the time each row is inserted.
Remove.


PS4, Line 383: Because the primary key columns are typically carefully planned 
and have known
             :             characteristics in terms of range and distribution, 
those columns are usually good
             :             candidates for using with the <codeph>PARTITION 
BY</codeph> clause of the
             :             <codeph>CREATE TABLE</codeph> statement.
Only primary key columns can be used in the PARTITION BY clause. This is a 
requirement and not some nice-to-have property.


PS4, Line 399: <p>
             :             For non-Kudu tables, Impala allows any column to 
contain <codeph>NULL</codeph>
             :             values, because it is not practical to enforce a 
<q>not null</q> constraint on HDFS
             :             data files that could be prepared using external 
tools and ETL processes.
             :           </p>
Move below L411.


PS4, Line 407: If a
             :             missing data field is considered to be a serious 
problem
I would rephrase it to something like "If an application requires a field to be 
always specified...".


Line 438:             operations.
Maybe we should extend this paragraph and say that for those reasons, it is 
recommended that users specify nullability constraints when they are known.


PS4, Line 449: -- This query can return a count of 0 very quickly because
             : -- column C3 by definition cannot contain any null values.
             : SELECT COUNT(*) FROM c3_has_no_nulls WHERE c3 IS NULL;
Do you suggest that this is not executed because we know it doesn't return any 
results? Have you verified this? The profile of a query like that doesn't seem 
to suggest any optimization like that, but I could be wrong.


PS4, Line 531: By default, each
             :             column uses the <q>plain</q> encoding where the data 
is stored unchanged
I would mention the encodings and maybe some guidelines on which one to use for 
different data types. Maybe check with Todd on recommendations.


PS4, Line 567: You can specify a compression algorithm to use for each column 
in a Kudu table. This
             :             attribute imposes more CPU overhead when retrieving 
the values than the
             :             <codeph>ENCODING</codeph> attribute does. Therefore, 
use it primarily for columns
             :             that are rarely included in result sets, or rarely 
used in <codeph>WHERE</codeph>
             :             clauses, or for very long strings that do not 
benefit much from the less-expensive
             :             <codeph>ENCODING</codeph> attribute.
Every compression alg has a tradeoff of compression ratio and cpu overhead and 
choosing the right comp. algorithm is always tricky and data dependent. Have 
you discussed these recommendations with Mostafa?


PS4, Line 690: PARTITIONED
PARTITION


PS4, Line 691: range specification clauses rather than a simple 
<codeph>PARTITIONED BY</codeph>
             :           clause that specifies only a column name and creates a 
new partition for each
             :           different value.
Why not putting the formal syntax instead of trying to describe the differences 
between Kudu and HDFS partitioning?


PS4, Line 697: With Kudu tables, all of the columns involved in these clauses 
must be primary key
             :           columns. These clauses let you specify different ways 
to divide the data for each
             :           column, or even for different value ranges within a 
column. This flexibility lets you
             :           avoid problems with uneven distribution of data, where 
the partitioning scheme for
             :           HDFS tables might result in some partitions being much 
larger than others. By setting
             :           up an effective partitioning scheme for a Kudu table, 
you can ensure that the work for
             :           a query can be parallelized evenly across the hosts in 
a cluster.
Remove this paragraph. You can use Kudu's white paper (section 3.2) for a 
better description of table partitioning.


PS4, Line 706: <note>
             :           <p>
             :             The Impala DDL syntax for Kudu tables is different 
than in early Kudu versions,
             :             which used an experimental fork of the Impala code. 
For example, the
             :             <codeph>DISTRIBUTE BY</codeph> clause is now 
<codeph>PARTITION BY</codeph>, the
             :             <codeph>INTO <varname>n</varname> BUCKETS</codeph> 
clause is now
             :             <codeph>PARTITIONS <varname>n</varname></codeph>and 
the range partitioning syntax
             :             is reworked to replace the <codeph>SPLIT 
ROWS</codeph> clause with more expressive
             :             syntax involving comparison operators.
             :           </p>
             :         </note>
I don't understand the point of this paragraph before even presenting the 
formal syntax for partitioning Kudu tables. If you want to mention that the old 
syntax is not supported add a note at the end.


PS4, Line 727: simplest
remove


PS4, Line 727: default type
Why "default"? Both hash and range need to be explicitly specified. None of 
them is implied.


PS4, Line 728: inserted rows are divided up between a fixed number
             :             of <q>buckets</q> in an essentially random way. 
Applying the hash function to the
             :             column values ensures that rows with similar values 
are evenly distributed, instead of
             :             clumping together all in the same bucket.
Why not just saying that rows are assigned to buckets by applying a hash 
function on the values of the columns used to define the hash partitioning?


PS4, Line 731: Spreading new rows across the buckets this
             :             way lets insertion operations work in parallel 
across all the tablet servers.
You need to mention the drawback of using hash partitioning as well. i.e. 
queries with range predicates may end up scanning multiple tables.


PS4, Line 743: -- No matter how the values are distributed in the source table, 
the data will be
             : -- evenly distributed between the buckets in the destination 
table.
That claim in not universally true. I would just remove it or make it case 
specific. e.g. since ids are unique you expect the rows to be evenly 
distributed across partitions.


PS4, Line 765: for when you know the
             :             expected distribution of values in a column and can 
make informed decisions about
             :             how to divide them.
That's not necessarily the primary reason for using range partitioning. You may 
want to stick to the formal description and syntax here and add a section about 
use cases (schema design) later.


PS4, Line 767: more <codeph>RANGE</codeph> clauses to the
             :             <codeph>CREATE TABLE</codeph> statement, following 
the <codeph>PARTITION BY</codeph
Can we add a formal syntax here?


PS4, Line 774: <codeph>VALUE</codeph> or <codeph>VALUES</codeph>
You need to talk when VALUE and VALUES is used (single vs multi-column range) 
and give proper examples for both cases.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml
File docs/topics/impala_tables.xml:

PS4, Line 293: Tables using the Apache Kudu storage system are treated 
specially, because Kudu manages its data independently of HDFS files.
             :         Some information about the table is stored in the 
metastore database for use by Impala. Other table metadata is
             :         managed internally by Kudu.
Where do we talk about external vs managed Kudu tables?


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

[Impala-ASF-CR] Major update to Impala + Kudu page

Reply via email to