[Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page

Dimitris Tsirogiannis (Code Review) Fri, 20 Jan 2017 13:21:56 -0800

Dimitris Tsirogiannis has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page
......................................................................



Patch Set 6:

(15 comments)

Another round of comments. I've seen that not all previous comments have been 
addressed, so I'll wait for a new patch before continuing this review.

http://gerrit.cloudera.org:8080/#/c/5649/6/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS6, Line 39: the Apache Kudu component
That still sounds weird. I'd switch to what Todd suggested.


PS6, Line 45: The default Impala tables use data files stored on HDFS, which 
are ideal for bulk loads
            :       and queries using full-table scans. In contrast, Kudu can 
do efficient queries for data
            :       organized either in data warehouse style (with full table 
scans) or for OLTP-style
            :       workloads (with key-based and range-based lookups for 
single rows or groups of rows). Kudu
            :       tables are suitable for frequent small additions or changes.
By default, Impala tables are stored in HDFS using various file formats. HDFS 
files allow for fast bulk loads (appends) and full-table scans but cannot 
support in-place updates (updates, deletes). Kudu is an alternative storage 
engine that can be used in Impala and supports both in-place updates (for 
OLTP-style operations) and fast scans (for data-warehouse/analytic operations).


PS6, Line 55: work 
work only


PS6, Line 73: In these scenarios (such as for streaming data), it
            :         might be impractical to use Parquet tables because 
Parquet works best with
            :         multi-megabyte data files, requiring substantial overhead 
to replace or reorganize data
            :         files to accomodate frequent additions or changes to 
data. 
I don't think we should emphasize Parquet here. It is a limitation of the 
storage engine not the file format. You can mention parquet as an example of a 
commonly used file format.


PS6, Line 78: without replacing the entire table contents
remove. Just say "efficiently".


PS6, Line 79: API
Maybe mention supported languages (Python, Java, etc).


PS6, Line 138: Data is physically divided automatically by Kudu. You do not 
deal with explicit
             :               partitions, as in typical large Impala tables. New 
data that arrives is organized
             :               based on the data values of each row, not kept 
together in partitions that must be
             :               created and managed individually.
I don't agree with this description. You have to decide for each table the 
partitioning scheme and all its details (number of partitions, actual range 
partitions, etc). What you don't control is the mapping of rows to physical 
nodes.


PS6, Line 147: Data is physically divided, and work is parallelized, based on 
units called
             :               <term>tablets</term> and <term>tablet 
servers</term>.
This is pretty vague. You need to make the distinction between tablets and 
tablet servers more clear.


PS6, Line 169: CREATE TABLE and ALTER TABLE
How about DROP TABLE?


PS6, Line 181: Because Kudu
incomplete sentence


PS6, Line 184: tables have features and properties that do not apply to other 
kinds of Impala tables,
             :         familiarize yourself with Kudu-related concepts and 
syntax first.
incomplete sentence


PS6, Line 214: arrange
What does "arrange" mean? If you refer to mapping of rows to tablets say so, 
otherwise remove.


PS6, Line 215: The primary key columns are typically ones that are frequently 
used in <codeph>WHERE</codeph>
             :               clauses and are highly selective.
That is not necessarily true.


PS6, Line 234: These restrictions
You mean the uniqueness and nullability constraints? These are indeed enforced 
in Kudu but I wouldn't call them restrictions. Allowing PRIMARY KEY and NOT 
NULL on only on Kudu tables is a restriction enforced by Impala during the 
analysis.


PS6, Line 714: evenly
remove


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

[Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page

Reply via email to