Dimitris Tsirogiannis has posted comments on this change. Change subject: [DOCS] Major update to Impala + Kudu page ......................................................................
Patch Set 6: (15 comments) Another round of comments. I've seen that not all previous comments have been addressed, so I'll wait for a new patch before continuing this review. http://gerrit.cloudera.org:8080/#/c/5649/6/docs/topics/impala_kudu.xml File docs/topics/impala_kudu.xml: PS6, Line 39: the Apache Kudu component That still sounds weird. I'd switch to what Todd suggested. PS6, Line 45: The default Impala tables use data files stored on HDFS, which are ideal for bulk loads : and queries using full-table scans. In contrast, Kudu can do efficient queries for data : organized either in data warehouse style (with full table scans) or for OLTP-style : workloads (with key-based and range-based lookups for single rows or groups of rows). Kudu : tables are suitable for frequent small additions or changes. By default, Impala tables are stored in HDFS using various file formats. HDFS files allow for fast bulk loads (appends) and full-table scans but cannot support in-place updates (updates, deletes). Kudu is an alternative storage engine that can be used in Impala and supports both in-place updates (for OLTP-style operations) and fast scans (for data-warehouse/analytic operations). PS6, Line 55: work work only PS6, Line 73: In these scenarios (such as for streaming data), it : might be impractical to use Parquet tables because Parquet works best with : multi-megabyte data files, requiring substantial overhead to replace or reorganize data : files to accomodate frequent additions or changes to data. I don't think we should emphasize Parquet here. It is a limitation of the storage engine not the file format. You can mention parquet as an example of a commonly used file format. PS6, Line 78: without replacing the entire table contents remove. Just say "efficiently". PS6, Line 79: API Maybe mention supported languages (Python, Java, etc). PS6, Line 138: Data is physically divided automatically by Kudu. You do not deal with explicit : partitions, as in typical large Impala tables. New data that arrives is organized : based on the data values of each row, not kept together in partitions that must be : created and managed individually. I don't agree with this description. You have to decide for each table the partitioning scheme and all its details (number of partitions, actual range partitions, etc). What you don't control is the mapping of rows to physical nodes. PS6, Line 147: Data is physically divided, and work is parallelized, based on units called : <term>tablets</term> and <term>tablet servers</term>. This is pretty vague. You need to make the distinction between tablets and tablet servers more clear. PS6, Line 169: CREATE TABLE and ALTER TABLE How about DROP TABLE? PS6, Line 181: Because Kudu incomplete sentence PS6, Line 184: tables have features and properties that do not apply to other kinds of Impala tables, : familiarize yourself with Kudu-related concepts and syntax first. incomplete sentence PS6, Line 214: arrange What does "arrange" mean? If you refer to mapping of rows to tablets say so, otherwise remove. PS6, Line 215: The primary key columns are typically ones that are frequently used in <codeph>WHERE</codeph> : clauses and are highly selective. That is not necessarily true. PS6, Line 234: These restrictions You mean the uniqueness and nullability constraints? These are indeed enforced in Kudu but I wouldn't call them restrictions. Allowing PRIMARY KEY and NOT NULL on only on Kudu tables is a restriction enforced by Impala during the analysis. PS6, Line 714: evenly remove -- To view, visit http://gerrit.cloudera.org:8080/5649 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell <jruss...@cloudera.com> Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org> Gerrit-Reviewer: John Russell <jruss...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes