[Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page

John Russell (Code Review) Tue, 24 Jan 2017 10:40:55 -0800

John Russell has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page
......................................................................



Patch Set 4:

(32 comments)

Did all of the recent comments. I see some earlier ones still to do.

http://gerrit.cloudera.org:8080/#/c/5649/6/docs/shared/impala_common.xml
File docs/shared/impala_common.xml:

PS6, Line 3741: 
              : 
              : 
              : 
              : 
> We do need it when the changes happen externally.
Done. Per Dimitris's comment, I'm leaving this text unchanged.


http://gerrit.cloudera.org:8080/#/c/5649/6/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS6, Line 39: he Apache Kudu component.
> That still sounds weird. I'd switch to what Todd suggested.
Done. It is the nature though of language involving trademarks to always sound 
weird.


PS6, Line 45: The default Impala tables use data files stored on HDFS, which 
are ideal for bulk loads
            :       and queries using full-table scans. In contrast, Kudu can 
do efficient queries for data
            :       organized either in data warehouse style (with full table 
scans) or for OLTP-style
            :       workloads (with key-based and range-based lookups for 
single rows or groups of rows). Kudu
            :       tables are suitable for frequent small additions or changes.
> By default, Impala tables are stored in HDFS using various file formats. HD
Done


PS6, Line 55: work 
> work only
Done


PS6, Line 73: In these scenarios (such as for streaming data), it
            :         might be impractical to use Parquet tables because 
Parquet works best with
            :         multi-megabyte data files, requiring substantial overhead 
to replace or reorganize data
            :         files to accomodate frequent additions or changes to 
data. 
> I don't think we should emphasize Parquet here. It is a limitation of the s
Done. I'm going to give a nod to "simplifying the ETL pipeline" in the reworded 
2nd paragraph on this page.


PS6, Line 78: without replacing the entire table contents
> remove. Just say "efficiently".
Done


PS6, Line 79: API
> Maybe mention supported languages (Python, Java, etc).
Done


PS6, Line 138: Data is physically divided automatically by Kudu. You do not 
deal with explicit
             :               partitions, as in typical large Impala tables. New 
data that arrives is organized
             :               based on the data values of each row, not kept 
together in partitions that must be
             :               created and managed individually.
> I don't agree with this description. You have to decide for each table the 
Let me reword to say you get a combination of control and flexibility, since 
you can still make as many narrow range partitions as you like, or specify wide 
range partitions or hashing on top of range partitions.


PS6, Line 147: Data is logically divided, and work is parallelized, based on 
units called
             :               <term>tablets</term> and <term>tablet 
servers</term>.
> This is pretty vague. You need to make the distinction between tablets and 
I'll elaborate a little more, than link to the Kudu docs for full definitions.


PS6, Line 169: 
> How about DROP TABLE?
I'm primarily covering new syntax here. Why don't I say Impala DDL 
"Enhancements" in the title since DROP TABLE is the same syntax as always.


PS6, Line 181: TABLE</codep
> incomplete sentence
Done


PS6, Line 184: familiarize yourself with Kudu-related concepts and syntax first.
             :       </p>
> incomplete sentence
Done. A sentence got inserted in the middle of another sentence, so the 2 
incomplete ones are part of the same original sentence.


PS6, Line 214: y ones 
> What does "arrange" mean? If you refer to mapping of rows to tablets say so
Done. I'll say it maps the rows to tablets.


PS6, Line 215: clauses and are highly selective.
             :             </p>
> That is not necessarily true.
Done. I'll take out the mention of WHERE clauses. I think by definition the 
combination of primary key columns is highly selective because of the 
uniqueness aspect. So having a repetitive column as part of the primary key 
would be wasteful and therefore rare.


PS6, Line 234: 
> You mean the uniqueness and nullability constraints? These are indeed enfor
Done. OK, reworded as "these constraints".


PS6, Line 266: 
> constant expression
Done


PS6, Line 490: 
> colloquial phrasing, how about among rel. db mgmt systems
Done. I'll leave out the word "relational" because I always worry about people 
getting the wrong idea w.r.t. transactions, foreign keys, etc.


PS6, Line 561: 
             :         <title>COMPRESSION Attribute</title>
             : 
             :         <conbody>
> ?
That's a note to myself in an XML comment, it doesn't appear in the output. If 
I specify a bogus keyword with the ENCODING clause, the Impala error message 
tells me that it's expecting keywords including UNKNOWN and GROUP_VARINT, but 
in experiments I couldn't get Impala to accept a CREATE TABLE statement with 
ENCODING UNKNOWN or ENCODING GROUP_VARINT. Maybe there are some obsolete 
keywords left in our parser from earlier revisions of Kudu?


PS6, Line 677: 
> internally
Done.


PS6, Line 714: s.
> remove
Done


Line 740:   PARTITION BY HASH(id) PARTITIONS 50
> missing space
Done


PS6, Line 755: 
> there is no default
Done


PS6, Line 760: 
> multiple
Done


PS6, Line 778: 
             : <codeblock><![CDATA[
> I don't think this is a limitation
Ah right, I think it depends on the number of tablet servers and the cluster 
where I tried it happened to state an upper limit of 60 in the error message. 
But a bigger cluster would have stated a higher limit.  I'll reword the upper 
limit without trying to be too specific.


PS6, Line 856: p>
             : 
> I see what this is saying but I think this sentence will be  confusing. It 
Done. I'll clarify this particular sentence on this pass. Perhaps go into more 
detail about the gap aspect in a subsequent iteration.


PS6, Line 903: <codeph>CREATE TABLE</codeph> syntax displayed by this statement 
includes all the
             :             hash, range, or both clauses that reflect the 
original table structure pl
> this makes it sound like you can't drop a range unless it's empty which is 
Done


PS6, Line 993:   <concept id="kudu_etl">
             : 
> one of these says:
Already discussed in impala_common.xml. I'll leave as-is.


PS6, Line 1099: e. For example, you cannot do a sequence of
              :         <codeph>UPDATE</codeph> statements and only make the 
change visible after all the
              :         statements are finished. Also, if a DML statement fails 
partway through, any rows that
              :         were already inserted, deleted, or changed remain in 
the table; there is no rollback
              :         mechanism to undo the changes.
              :    
> looks like something is missing
I'm going to use some discretion about which assertions to back up and 
illustrate with examples. I'll remove most of the empty <codeblock> elements on 
this page for now, and fill in additional examples in a subsequent iteration.


PS6, Line 1207: 
              :           <p>
              :             Sentry authorization.
              :           </p>
              :         </li>
              : 
> empty code block?
This one in particular I'll delete for now rather than fill in. Because the 
success messages for DELETE / UPDATE / UPSERT don't report number of rows 
affected.


PS6, Line 1250: y using a clause such
> list the limitations?
OK, I'll reuse the same verbiage as under the GRANT statement. (There's a 
#include-like mechanism in this XML dialect so it'll show up as a 1-liner here, 
referencing an ID with the main text in impala_common.xml.)


PS6, Line 1308: 
              : 
              : 
> I don't see any other content
I've been toning down the compare-and-contrast with Parquet in other areas. I 
don't have enough concrete info right now to make a compelling section. I'm 
going to suppress this topic in the output by applying audience="hidden" in the 
<concept> tag.


PS6, Line 1323: 
              : 
              : 
> not sure if this section is useful as-is
Exactly, I'll hide but leave the skeleton here in case it makes sense to 
revisit later.


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

[Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page

Reply via email to