Dimitris Tsirogiannis has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page
......................................................................


Patch Set 7:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/5649/7/docs/shared/impala_common.xml
File docs/shared/impala_common.xml:

PS7, Line 3736: The <codeph>REFRESH</codeph> and <codeph>INVALIDATE 
METADATA</codeph>
              :         statements are needed less frequently for Kudu tables 
than for
              :         HDFS-backed tables. Neither statement is needed when 
data is
              :         added to, removed, or updated in a Kudu table, even if 
the changes
              :         are made directly to Kudu through a client program 
using the Kudu API.
              :         Run <codeph>REFRESH 
<varname>table_name</varname></codeph> or
              :         <codeph>INVALIDATE METADATA 
<varname>table_name</varname></codeph>
              :         for a Kudu table only after making a change to the Kudu 
table schema,
              :         such as adding or dropping a column, by a mechanism 
other than
              :         Impala.
I would start by saying when to use REFRESH and INVALIDATE METADATA for Kudu 
table and then talk about the cases where it is not required.


PS7, Line 3751: internal
Don't we also call them "managed"? Not sure.


PS7, Line 3749: The distinction between internal and external tables has some 
special
              :         details for Kudu tables. Tables created entirely 
through Impala are
              :         internal tables.
It also has "special details" for non-Kudu tables. I would simply say "Impala 
supports both internal (managed) and external Kudu tables. Internal tables.... "


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_alter_table.xml
File docs/topics/impala_alter_table.xml:

PS7, Line 833: can specify
"can only specify"


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_create_table.xml
File docs/topics/impala_create_table.xml:

PS7, Line 338: clauses
I think you forgot caching which is also not supported for Kudu.


PS7, Line 352: <codeph>CREATE TABLE AS SELECT</codeph>
This is wrong. We do support CTAS for internal Kudu tables. See 
AnalyzeDDLTest.java for some examples.


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_drop_table.xml
File docs/topics/impala_drop_table.xml:

PS7, Line 160: managed
It's good to be consistent. In some places we use "managed" and in others 
"internal".


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_explain.xml
File docs/topics/impala_explain.xml:

PS7, Line 250:  
"in a SCAN KUDU node"


PS7, Line 251: , and might involve transmitting
             :       non-matching rows that are filtered out on the Impala side.
remove


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS7, Line 76: scan performance close to that of Parquet
Make sure you check with Mostafa about this claim.


PS7, Line 147: The work is parallelized
             :               across units of computing called <term>tablet 
servers</term>.
I believe the unit of computing is the tablet not the tablet server, unless I 
am misinterpreting the term. Please check with Todd.


PS7, Line 149: between the tablets and tablet servers.
I think it would be nice to mention that the user is not responsible for 
managing this mapping of tablet to tablet server.


PS7, Line 168: (CREATE TABLE and ALTER TABLE)
It is weird to mention CREATE without DROP. Please remove this and mention in 
the paragraph below that you can also drop Kudu tables through Impala.


PS7, Line 397: <p>
             :             For non-Kudu tables, Impala allows any column to 
contain <codeph>NULL</codeph>
             :             values, because it is not practical to enforce a 
<q>not null</q> constraint on HDFS
             :             data files that could be prepared using external 
tools and ETL processes.
             :           </p>
             : 
             :           <p 
conref="../shared/impala_common.xml#common/pk_implies_not_null"/>
             : 
             :           <p>
             :             For example, a table containing geographic 
information might require the latitude
             :             and longitude coordinates to always be specified. 
Other attributes might be allowed
             :             to be <codeph>NULL</codeph>. For example, a location 
might not have a designated
             :             place name, its altitude might be unimportant, and 
its population might be initially
             :             unknown, to be filled in later.
             :           </p>
You may want to swap these two paragraphs.


PS7, Line 717: PARTITION BY
Impala still uses "PARTITIONED BY" for HDFS tables.


PS7, Line 727: . By setting
             :           up an effective partitioning scheme for a Kudu table, 
you can ensure that the work for
             :           a query can be parallelized evenly across the hosts in 
a cluster.
Remove. Sometimes the goal is to scan as little as possible. You can say that 
more flexible schema designs allows you to perform more elaborate tuning based 
on the characteristics of the workload or something along these lines.


PS7, Line 936: To see the distribution of data in a Kudu table across the 
underlying buckets and
             :             partitions, use the <codeph>SHOW TABLE 
STATS</codeph> statement.
This is unfortunately not accurate. SHOW TABLE STATS will only show the 
distribution of replicas to tablet servers. Also, it will always show -1 in the 
#rows column and the key ranges in hex, so the output is confusing. I would 
just remove this paragraph.


PS7, Line 1122: change
"changes"


PS7, Line 1159: strong consistency for order of operations
I am not sure I know what that means.


PS7, Line 1159: total
              :         success or total failure of a multi-row statement
This is "atomicity". Maybe just mention atomic multi-row statements.


PS7, Line 1160: or data that is read while a write
              :         operation is in progress
Isolation.


PS7, Line 1288: <title>Memory Usage for Operations on Kudu Tables</title>
              : 
              :       <conbody>
              : 
              :         <p>
              :           The Apache Kudu architecture, topology, and data 
storage techniques result in
              :           different patterns of memory usage for Impala 
statements than with HDFS-backed tables.
              :         </p>
I don't find this particularly informative and suggest we remove it unless 
someone objects.


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_literals.xml
File docs/topics/impala_literals.xml:

PS7, Line 408: most Impala tables
Impala tables backed by HDFS or S3? "most" is kind of vague


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_revoke.xml
File docs/topics/impala_revoke.xml:

PS7, Line 115: access to a Kudu table is <q>all or nothing</q>.
"only table-level permissions are enforced in Kudu tables. Column-level 
permissions as well as permissions on certain operations, such as INSERT, are 
not supported. "


http://gerrit.cloudera.org:8080/#/c/5649/7/docs/topics/impala_tables.xml
File docs/topics/impala_tables.xml:

PS7, Line 293: using the Apache Kudu storage system
"stored in Apache Kudu"


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.k...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: John Russell <jruss...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to