Dimitris Tsirogiannis has posted comments on this change. Change subject: Major update to Impala + Kudu page ......................................................................
Patch Set 4: (31 comments) Initial round of comments. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_explain.xml File docs/topics/impala_explain.xml: PS4, Line 239: he <codeph>EXPLAIN</codeph> statement displays equivalent plan : information for queries against Kudu tables as for queries : against HDFS-based tables. Don't we need to talk about the predicates that get pushed to Kudu? http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_invalidate_metadata.xml File docs/topics/impala_invalidate_metadata.xml: PS4, Line 245: <!-- Anything to say for INVALIDATE METADATA with Kudu? --> : <!-- <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/> --> Where do we talk about Kudu table metadata? We need to mention that source of truth for table metadata (e.g. schema, partitions) is Kudu and not HMS. Invalidate metadata and refresh cause the table metadata to be reloaded from Kudu and any external schema changes to be reflected to the catalog. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml File docs/topics/impala_kudu.xml: PS4, Line 149: tablet servers is independent of the number of DataNodes in the cluster. One question that someone might have is whether tablet servers and data nodes need to be collocated or not for functional or performance reasons. It would be nice to comment on that here. PS4, Line 201: For a : partitioned Kudu table, Remove. There are only partitioned Kudu tables. PS4, Line 221: For example, if an <codeph>INSERT</codeph> operation fails partway through, only some of the : new rows might be present in the table. You can re-run the same <codeph>INSERT</codeph>, and : only the missing rows will be added. Or if data in the table is stale, you can run an : <codeph>UPSERT</codeph> statement that brings the data up to date, without the possibility : of creating duplicate copies of existing rows. Why blending the semantics of upsert and insert with the description of primary keys? I don't think this section belongs here. PS4, Line 233: These restrictions on the primary key columns How about the nullability restrictions on non-primary key columns. Where are they enforced? PS4, Line 247: <p> : Kudu can do extra optimizations for queries that refer to the primary key columns in : the <codeph>WHERE</codeph> clause. It is not required though to include the primary : key columns in the <codeph>WHERE</codeph> clause of every query. : </p> I am not sure I understand the purpose of this paragraph here. This either belongs to the place where you discuss schema design or even better performance tuning. PS4, Line 269: PRIMARY KEY : | [NOT] NULL : | ENCODING <varname>codec</varname> : | COMPRESSION <varname>algorithm</varname> : | DEFAULT <varname>expression</varname> : | BLOCK_SIZE <varname>number</varname> Wouldn't it make more sense to first describe the Kudu-specific CREATE TABLE syntax and then analyze the individual components of it? PS4, Line 299: You can specify the <codeph>PRIMARY KEY</codeph> attribute either inline in a single : column definition, or as a separate clause at the end of the column list: Even though this is the place to formally define the syntax for PRIMARY KEYs, you seem to have more detailed information about PKs in the paragraph in L238. I would remove that paragraph entirely and move that content here. PS4, Line 360: <p> : The notion of primary key only applies to Kudu tables. Every Kudu table requires a : primary key. The primary key consists of one or more columns. You must specify any : primary key columns first in the column list. : </p> Move to the beginning or in the general description of primary keys for Kudu tables. PS4, Line 368: Therefore, only : include a column in the primary key specification if you are certain of its value at : the time each row is inserted. Remove. PS4, Line 383: Because the primary key columns are typically carefully planned and have known : characteristics in terms of range and distribution, those columns are usually good : candidates for using with the <codeph>PARTITION BY</codeph> clause of the : <codeph>CREATE TABLE</codeph> statement. Only primary key columns can be used in the PARTITION BY clause. This is a requirement and not some nice-to-have property. PS4, Line 399: <p> : For non-Kudu tables, Impala allows any column to contain <codeph>NULL</codeph> : values, because it is not practical to enforce a <q>not null</q> constraint on HDFS : data files that could be prepared using external tools and ETL processes. : </p> Move below L411. PS4, Line 407: If a : missing data field is considered to be a serious problem I would rephrase it to something like "If an application requires a field to be always specified...". Line 438: operations. Maybe we should extend this paragraph and say that for those reasons, it is recommended that users specify nullability constraints when they are known. PS4, Line 449: -- This query can return a count of 0 very quickly because : -- column C3 by definition cannot contain any null values. : SELECT COUNT(*) FROM c3_has_no_nulls WHERE c3 IS NULL; Do you suggest that this is not executed because we know it doesn't return any results? Have you verified this? The profile of a query like that doesn't seem to suggest any optimization like that, but I could be wrong. PS4, Line 531: By default, each : column uses the <q>plain</q> encoding where the data is stored unchanged I would mention the encodings and maybe some guidelines on which one to use for different data types. Maybe check with Todd on recommendations. PS4, Line 567: You can specify a compression algorithm to use for each column in a Kudu table. This : attribute imposes more CPU overhead when retrieving the values than the : <codeph>ENCODING</codeph> attribute does. Therefore, use it primarily for columns : that are rarely included in result sets, or rarely used in <codeph>WHERE</codeph> : clauses, or for very long strings that do not benefit much from the less-expensive : <codeph>ENCODING</codeph> attribute. Every compression alg has a tradeoff of compression ratio and cpu overhead and choosing the right comp. algorithm is always tricky and data dependent. Have you discussed these recommendations with Mostafa? PS4, Line 690: PARTITIONED PARTITION PS4, Line 691: range specification clauses rather than a simple <codeph>PARTITIONED BY</codeph> : clause that specifies only a column name and creates a new partition for each : different value. Why not putting the formal syntax instead of trying to describe the differences between Kudu and HDFS partitioning? PS4, Line 697: With Kudu tables, all of the columns involved in these clauses must be primary key : columns. These clauses let you specify different ways to divide the data for each : column, or even for different value ranges within a column. This flexibility lets you : avoid problems with uneven distribution of data, where the partitioning scheme for : HDFS tables might result in some partitions being much larger than others. By setting : up an effective partitioning scheme for a Kudu table, you can ensure that the work for : a query can be parallelized evenly across the hosts in a cluster. Remove this paragraph. You can use Kudu's white paper (section 3.2) for a better description of table partitioning. PS4, Line 706: <note> : <p> : The Impala DDL syntax for Kudu tables is different than in early Kudu versions, : which used an experimental fork of the Impala code. For example, the : <codeph>DISTRIBUTE BY</codeph> clause is now <codeph>PARTITION BY</codeph>, the : <codeph>INTO <varname>n</varname> BUCKETS</codeph> clause is now : <codeph>PARTITIONS <varname>n</varname></codeph>and the range partitioning syntax : is reworked to replace the <codeph>SPLIT ROWS</codeph> clause with more expressive : syntax involving comparison operators. : </p> : </note> I don't understand the point of this paragraph before even presenting the formal syntax for partitioning Kudu tables. If you want to mention that the old syntax is not supported add a note at the end. PS4, Line 727: simplest remove PS4, Line 727: default type Why "default"? Both hash and range need to be explicitly specified. None of them is implied. PS4, Line 728: inserted rows are divided up between a fixed number : of <q>buckets</q> in an essentially random way. Applying the hash function to the : column values ensures that rows with similar values are evenly distributed, instead of : clumping together all in the same bucket. Why not just saying that rows are assigned to buckets by applying a hash function on the values of the columns used to define the hash partitioning? PS4, Line 731: Spreading new rows across the buckets this : way lets insertion operations work in parallel across all the tablet servers. You need to mention the drawback of using hash partitioning as well. i.e. queries with range predicates may end up scanning multiple tables. PS4, Line 743: -- No matter how the values are distributed in the source table, the data will be : -- evenly distributed between the buckets in the destination table. That claim in not universally true. I would just remove it or make it case specific. e.g. since ids are unique you expect the rows to be evenly distributed across partitions. PS4, Line 765: for when you know the : expected distribution of values in a column and can make informed decisions about : how to divide them. That's not necessarily the primary reason for using range partitioning. You may want to stick to the formal description and syntax here and add a section about use cases (schema design) later. PS4, Line 767: more <codeph>RANGE</codeph> clauses to the : <codeph>CREATE TABLE</codeph> statement, following the <codeph>PARTITION BY</codeph Can we add a formal syntax here? PS4, Line 774: <codeph>VALUE</codeph> or <codeph>VALUES</codeph> You need to talk when VALUE and VALUES is used (single vs multi-column range) and give proper examples for both cases. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml File docs/topics/impala_tables.xml: PS4, Line 293: Tables using the Apache Kudu storage system are treated specially, because Kudu manages its data independently of HDFS files. : Some information about the table is stored in the metastore database for use by Impala. Other table metadata is : managed internally by Kudu. Where do we talk about external vs managed Kudu tables? -- To view, visit http://gerrit.cloudera.org:8080/5649 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell <jruss...@cloudera.com> Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org> Gerrit-Reviewer: John Russell <jruss...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes