John Russell has posted comments on this change. Change subject: [DOCS] Major update to Impala + Kudu page ......................................................................
Patch Set 4: (32 comments) Did all of the recent comments. I see some earlier ones still to do. http://gerrit.cloudera.org:8080/#/c/5649/6/docs/shared/impala_common.xml File docs/shared/impala_common.xml: PS6, Line 3741: : : : : > We do need it when the changes happen externally. Done. Per Dimitris's comment, I'm leaving this text unchanged. http://gerrit.cloudera.org:8080/#/c/5649/6/docs/topics/impala_kudu.xml File docs/topics/impala_kudu.xml: PS6, Line 39: he Apache Kudu component. > That still sounds weird. I'd switch to what Todd suggested. Done. It is the nature though of language involving trademarks to always sound weird. PS6, Line 45: The default Impala tables use data files stored on HDFS, which are ideal for bulk loads : and queries using full-table scans. In contrast, Kudu can do efficient queries for data : organized either in data warehouse style (with full table scans) or for OLTP-style : workloads (with key-based and range-based lookups for single rows or groups of rows). Kudu : tables are suitable for frequent small additions or changes. > By default, Impala tables are stored in HDFS using various file formats. HD Done PS6, Line 55: work > work only Done PS6, Line 73: In these scenarios (such as for streaming data), it : might be impractical to use Parquet tables because Parquet works best with : multi-megabyte data files, requiring substantial overhead to replace or reorganize data : files to accomodate frequent additions or changes to data. > I don't think we should emphasize Parquet here. It is a limitation of the s Done. I'm going to give a nod to "simplifying the ETL pipeline" in the reworded 2nd paragraph on this page. PS6, Line 78: without replacing the entire table contents > remove. Just say "efficiently". Done PS6, Line 79: API > Maybe mention supported languages (Python, Java, etc). Done PS6, Line 138: Data is physically divided automatically by Kudu. You do not deal with explicit : partitions, as in typical large Impala tables. New data that arrives is organized : based on the data values of each row, not kept together in partitions that must be : created and managed individually. > I don't agree with this description. You have to decide for each table the Let me reword to say you get a combination of control and flexibility, since you can still make as many narrow range partitions as you like, or specify wide range partitions or hashing on top of range partitions. PS6, Line 147: Data is logically divided, and work is parallelized, based on units called : <term>tablets</term> and <term>tablet servers</term>. > This is pretty vague. You need to make the distinction between tablets and I'll elaborate a little more, than link to the Kudu docs for full definitions. PS6, Line 169: > How about DROP TABLE? I'm primarily covering new syntax here. Why don't I say Impala DDL "Enhancements" in the title since DROP TABLE is the same syntax as always. PS6, Line 181: TABLE</codep > incomplete sentence Done PS6, Line 184: familiarize yourself with Kudu-related concepts and syntax first. : </p> > incomplete sentence Done. A sentence got inserted in the middle of another sentence, so the 2 incomplete ones are part of the same original sentence. PS6, Line 214: y ones > What does "arrange" mean? If you refer to mapping of rows to tablets say so Done. I'll say it maps the rows to tablets. PS6, Line 215: clauses and are highly selective. : </p> > That is not necessarily true. Done. I'll take out the mention of WHERE clauses. I think by definition the combination of primary key columns is highly selective because of the uniqueness aspect. So having a repetitive column as part of the primary key would be wasteful and therefore rare. PS6, Line 234: > You mean the uniqueness and nullability constraints? These are indeed enfor Done. OK, reworded as "these constraints". PS6, Line 266: > constant expression Done PS6, Line 490: > colloquial phrasing, how about among rel. db mgmt systems Done. I'll leave out the word "relational" because I always worry about people getting the wrong idea w.r.t. transactions, foreign keys, etc. PS6, Line 561: : <title>COMPRESSION Attribute</title> : : <conbody> > ? That's a note to myself in an XML comment, it doesn't appear in the output. If I specify a bogus keyword with the ENCODING clause, the Impala error message tells me that it's expecting keywords including UNKNOWN and GROUP_VARINT, but in experiments I couldn't get Impala to accept a CREATE TABLE statement with ENCODING UNKNOWN or ENCODING GROUP_VARINT. Maybe there are some obsolete keywords left in our parser from earlier revisions of Kudu? PS6, Line 677: > internally Done. PS6, Line 714: s. > remove Done Line 740: PARTITION BY HASH(id) PARTITIONS 50 > missing space Done PS6, Line 755: > there is no default Done PS6, Line 760: > multiple Done PS6, Line 778: : <codeblock><![CDATA[ > I don't think this is a limitation Ah right, I think it depends on the number of tablet servers and the cluster where I tried it happened to state an upper limit of 60 in the error message. But a bigger cluster would have stated a higher limit. I'll reword the upper limit without trying to be too specific. PS6, Line 856: p> : > I see what this is saying but I think this sentence will be confusing. It Done. I'll clarify this particular sentence on this pass. Perhaps go into more detail about the gap aspect in a subsequent iteration. PS6, Line 903: <codeph>CREATE TABLE</codeph> syntax displayed by this statement includes all the : hash, range, or both clauses that reflect the original table structure pl > this makes it sound like you can't drop a range unless it's empty which is Done PS6, Line 993: <concept id="kudu_etl"> : > one of these says: Already discussed in impala_common.xml. I'll leave as-is. PS6, Line 1099: e. For example, you cannot do a sequence of : <codeph>UPDATE</codeph> statements and only make the change visible after all the : statements are finished. Also, if a DML statement fails partway through, any rows that : were already inserted, deleted, or changed remain in the table; there is no rollback : mechanism to undo the changes. : > looks like something is missing I'm going to use some discretion about which assertions to back up and illustrate with examples. I'll remove most of the empty <codeblock> elements on this page for now, and fill in additional examples in a subsequent iteration. PS6, Line 1207: : <p> : Sentry authorization. : </p> : </li> : > empty code block? This one in particular I'll delete for now rather than fill in. Because the success messages for DELETE / UPDATE / UPSERT don't report number of rows affected. PS6, Line 1250: y using a clause such > list the limitations? OK, I'll reuse the same verbiage as under the GRANT statement. (There's a #include-like mechanism in this XML dialect so it'll show up as a 1-liner here, referencing an ID with the main text in impala_common.xml.) PS6, Line 1308: : : > I don't see any other content I've been toning down the compare-and-contrast with Parquet in other areas. I don't have enough concrete info right now to make a compelling section. I'm going to suppress this topic in the output by applying audience="hidden" in the <concept> tag. PS6, Line 1323: : : > not sure if this section is useful as-is Exactly, I'll hide but leave the skeleton here in case it makes sense to revisit later. -- To view, visit http://gerrit.cloudera.org:8080/5649 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell <jruss...@cloudera.com> Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org> Gerrit-Reviewer: John Russell <jruss...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes