[Impala-ASF-CR] IMPALA-4029: Reduce memory requirements for storing file metadata

2017-04-10 Thread Dimitris Tsirogiannis (Code Review)
Hello Bharath Vissapragada,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6406

to look at the new patch set (#4).

Change subject: IMPALA-4029: Reduce memory requirements for storing file 
metadata
..

IMPALA-4029: Reduce memory requirements for storing file metadata

This commit improves the memory requirements for storing file and block
metadata in the catalog and the impalad nodes by using the FlatBuffers
serialization library.

Testing:
Passed an exhaustive tests run.

Benchmark:
Memory requirement for storing an HDFS table with 250K files is reduced
by 2.5X.

Change-Id: I483d3cadc9d459f71a310c35a130d073597b0983
---
M CMakeLists.txt
A common/fbs/CMakeLists.txt
A common/fbs/CatalogObjects.fbs
M common/thrift/CatalogObjects.thrift
M fe/CMakeLists.txt
M fe/pom.xml
M fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/Table.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/catalog/CatalogObjectToFromThriftTest.java
13 files changed, 556 insertions(+), 313 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/6406/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6406
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I483d3cadc9d459f71a310c35a130d073597b0983
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4029: Reduce memory requirements for storing file metadata

2017-04-10 Thread Dimitris Tsirogiannis (Code Review)
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-4029: Reduce memory requirements for storing file 
metadata
..


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/6406/3/common/fbs/CatalogObjects.fbs
File common/fbs/CatalogObjects.fbs:

Line 36:   offset: long = 0 (id: 0);
> Why a default value? Seems potentially dangerous.
Done


Line 40:   length: long = -1 (id: 1);
> Seems redundant, why keep it?
You mean why not compute it on the fly by iterating on the file blocks?


Line 65:   compression: FbCompression (id: 3);
> Will FlatBuffers add padding to align members? Ideally, we'd optimize for s
Hm, not sure I understand what the ask is here? Should I try to merge this with 
some other field or something else?


http://gerrit.cloudera.org:8080/#/c/6406/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

Line 218:   1: required binary file_desc_data
> Leave a todo to put this in a KRPC sidecar :) No need to pay serialization 
Done


Line 218:   1: required binary file_desc_data
> Nice
Done


Line 296:   10: optional list file_name_prefixes
> Is this required for the move to flat buffers? I think we should consider a
No, this is definitely not needed now. Actually, I feel the current 
implementation is too flaky and not easy to explain. I'll remove it for now and 
leave a TODO. We can revisit the file name compression later.


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
File fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java:

Line 78:   public byte toFbCompression() {
> toFlatBuffer()?
Done


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 126: locations = fileSystem.getFileBlockLocations(fileStatus, 0, 
fileStatus.getLen());
> I think it would be better for now to keep all the code that fetches inform
Done


Line 227: loc.getNames().length);
> Another case where we might be calling out to the NN.
I don't think any of these statements is making a call to NN. Which one are you 
talking about?


Line 321: private static int REPLICA_HOST_CACHE_MASK = 0x8000;
> Less code and more readable to have one var with (1 << 15). The places that
Done


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 189:   // List of file name prefixes.
> Sorted list of file name prefixes?
This was removed.


Line 309: Pair compressedFileName =
> How predictable is the compression behavior? Do we iterate over the files i
Removed.


Line 340:* the suffix is equal to 'fileName'.
> Should mention that this function expects the fileNames to be sorted.
Removed.


Line 346: String commonPrefix = Strings.commonPrefix(prevFileName, 
fileName);
> I'm wondering if it's better to first check the last common prefix instead 
Removed.


-- 
To view, visit http://gerrit.cloudera.org:8080/6406
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I483d3cadc9d459f71a310c35a130d073597b0983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Henry Robinson 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#10).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M  28.80M  487.22 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS31s245ms1s

[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Hello Dan Hecht, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6459

to look at the new patch set (#10).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M   

[Impala-ASF-CR] IMPALA-3040: Fix test caching ddl test

2017-04-10 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-3040: Fix test_caching_ddl test
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6603/1/tests/query_test/test_hdfs_caching.py
File tests/query_test/test_hdfs_caching.py:

Line 210: while time.time() - begin_time < hdfs_caching_timeout_in_sec:
As discussed, might be more general to apply a fix to get_num_cache_requests() 
so that we keep retrying until the number stabilizes.


-- 
To view, visit http://gerrit.cloudera.org:8080/6603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ec4ba5dfae6e90a2bb76e22c93909b05bd78fa4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis 
Gerrit-Reviewer: Alex Behm 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4623: Enable file handle cache

2017-04-10 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4623: Enable file handle cache
..


Patch Set 3:

(9 comments)

i don't understand the need for unique_ptrs for the file handles, please see 
comments inline.

http://gerrit.cloudera.org:8080/#/c/6478/3/be/src/runtime/disk-io-mgr-scan-range.cc
File be/src/runtime/disk-io-mgr-scan-range.cc:

Line 286:   if (fs_ != NULL) {
replace w/ nullptr throughout (and in the other files you touch).


http://gerrit.cloudera.org:8080/#/c/6478/3/be/src/runtime/disk-io-mgr.h
File be/src/runtime/disk-io-mgr.h:

Line 205:   /// additionally encapsulates the last modified time of the 
associated file when it was
fix this sentence.


Line 207:   class HdfsCachedFileHandle {
we use this class exclusively to access hdfs data, even if the file handle 
isn't cached. remove 'cached' from the name, it's distracting.


Line 476: /// Opens the file for this range. This function only modifies 
state in this range.
explain param


Line 489: void GetHdfsStatistics(hdfsFile fh);
is this okay to pass by value?

what diskiorequestcontext is this referring to? the comment should make sense 
on its own.


Line 521: /// File handle for local fs (FILE*)
remove (file*)


Line 526: std::unique_ptr hdfs_file_;
requestrange already contains an fs_. what of  the functionality provided by 
hdfscachedfilehandle do you actually need here? this class also contains an 
mtime_, so lots of confusing and error-prone duplication.

this encapsulates a file handle out of the cache. but if the cache is used to 
manage all file handles, why do you need to store this with a unique_ptr 
(instead of a raw ptr, and then simply hand that back to the cache when you're 
done)? this looks like it contradicts the responsibility of the cache for 
managing file handles.


Line 774:   /// and return an instance of HdfsCachedFileHandle. In case of an 
error returns NULL.
explain all parameters


Line 784:   std::unique_ptr& fid);
no ref parameters


-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3748: minimum buffer requirements in planner

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3748: minimum buffer requirements in planner
..


Patch Set 13:

(24 comments)

Addressed most of the comments. I need to investigate why we're getting 
incorrect values for some of the parallel plans.

http://gerrit.cloudera.org:8080/#/c/5847/13/be/src/scheduling/query-schedule.cc
File be/src/scheduling/query-schedule.cc:

Line 212
> do we still need this?
It wasn't reachable as a result of other changes since we now compute estimates 
for all queries, rather than omitting estimates in a fairly arbitrary way.


http://gerrit.cloudera.org:8080/#/c/5847/13/be/src/service/query-exec-state.cc
File be/src/service/query-exec-state.cc:

Line 396:   // Add info strings consumed by CM: Estimated mem and tables 
missing stats.
> update
It wasn't clear that it made sense to include the reservation in this comment, 
since we we don't have anything consuming it.


Line 402:   if (query_exec_request.__isset.per_host_min_reservation_bytes) {
> can we ditch the _bytes everywhere and declare it's in bytes?
Done. It's already documented in the thrift file that it's bytes.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

Line 252: long perInstanceMinBufferBytes = 2 * SPILLABLE_BUFFER_BYTES;
> is this covered by a test case?
We have a planner test in resource-requirements.test.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 103:   private final static int MAX_THREAD_TOKENS_PER_CORE = 3;
> no idea where that's from, but it seems wrong.
It's the default value of the --num_threads_per_core command-line option, which 
is multiplied by the number of cores to get the number of thread tokens. Agree 
there are some problems with this approach so thought it was best to leave 
as-is and document it.

Added to the comment to document the relationship between this and the flag.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 80: long numPartitions =
> rename to perinstance-
Done


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

Line 268: // TODO: add an estimate
> remove todo (we already have a todo to remove the estimates)
Done


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/KuduTableSink.java
File fe/src/main/java/org/apache/impala/planner/KuduTableSink.java:

Line 60: if (explainLevel.ordinal() >= TExplainLevel.EXTENDED.ordinal()) {
> this if block is in every data sink's explainstring. move it into a functio
Refactored to use the same pattern as PlanNode, where 
DataSink.getExplainString() calls a protected method to get the 
subclass-specific explain string.

We also don't need the number of hosts/instances, since it's already reported 
in the fragment header, so I removed that.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/PipelinedPlanNodeSet.java
File fe/src/main/java/org/apache/impala/planner/PipelinedPlanNodeSet.java:

Line 85: long minBufferBytes = 0L;
> minReservationBytes
Done


Line 93:   node.computeResourceProfile(queryOptions);
> rather than calling this here, call it after creating plan fragments.
Moved to PlanFragment.finalize() - that seemed like a logical place.


Line 126:   sink.computeResourceProfile(queryOptions);
> same here
Done


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/PlanFragment.java
File fe/src/main/java/org/apache/impala/planner/PlanFragment.java:

Line 202:   public int getNumNodes() {
> do we still need this?
It's still used in a couple of places - e.g. costing shuffle vs broadcast.


Line 228: * Estimates the per-fragment-instance number of distinct values 
of exprs based on the
> not super-important, but: should it be "per-fragment instance number"?
Reworded the sentence to avoid weird hyphenation.


Line 297:   str.append(String.format("%s:PLAN FRAGMENT [%s]", 
fragmentId_.toString(),
> also call getfragmentheaderstring() here
Done. I guess the format below wasn't consistent with the pre-existing format, 
so I fixed that too.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

Line 276:   TQueryOptions queryOptions, TExplainLevel detailLevel) {
> tqueryoptions already contains the explain level. either pass in mt_dop exp
Passing in detailLevel seems to be a delibe

[Impala-ASF-CR] IMPALA-3748: minimum buffer requirements in planner

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#14).

Change subject: IMPALA-3748: minimum buffer requirements in planner
..

IMPALA-3748: minimum buffer requirements in planner

Compute the minimum buffer requirement for spilling nodes and
per-host estimates for the entire plan tree.

This builds on top of the existing resource estimation code, which
computes the sets of plan nodes that can execute concurrently. This is
cleaned up so that the process of producing resource requirements is
clearer. It also removes the unused VCore estimates.

Fixes various bugs and other issues:
* computeCosts() was not called for unpartitioned fragments, so
  the per-operator memory estimate was not visible.
* Nested loop join was not treated as a blocking join.
* The TODO comment about union was misleading
* Fix the computation for mt_dop > 1 by distinguishing per-instance and
  per-host estimates.
* Always generate an estimate instead of unpredictably returning
  -1/"unavailable" in many circumstances - there was little rhyme or
  reason to when this happened.
* Remove the special "trivial plan" estimates. With the rest of the
  cleanup we generate estimates <= 10MB for those trivial plans through
  the normal code path.

I left one bug (IMPALA-4862) unfixed because it is subtle, will affect
estimates for many plans and will be easier to review once we have the
test infra in place.

Testing:
Added basic planner tests for resource requirements in both the MT and
non-MT cases.

Re-enabled the explain_level tests, which appears to be the only
coverage for many of these estimates. Removed the complex and
brittle test cases and replaced with a couple of much simpler
end-to-end tests.

Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
---
M be/src/exec/analytic-eval-node.cc
M be/src/exec/partitioned-aggregation-node.h
M be/src/exec/partitioned-hash-join-builder.h
M be/src/runtime/sorter.cc
M be/src/scheduling/query-schedule.cc
M be/src/service/query-exec-state.cc
M common/thrift/Frontend.thrift
M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/common/PrintUtils.java
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DataSink.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/DataStreamSink.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/EmptySetNode.java
M fe/src/main/java/org/apache/impala/planner/ExchangeNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseTableSink.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/JoinBuildSink.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduTableSink.java
M fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java
M fe/src/main/java/org/apache/impala/planner/PipelinedPlanNodeSet.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/planner/PlanRootSink.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
A fe/src/main/java/org/apache/impala/planner/ResourceProfile.java
M fe/src/main/java/org/apache/impala/planner/SelectNode.java
M fe/src/main/java/org/apache/impala/planner/SingularRowSrcNode.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M fe/src/main/java/org/apache/impala/planner/SubplanNode.java
M fe/src/main/java/org/apache/impala/planner/UnionNode.java
M fe/src/main/java/org/apache/impala/planner/UnnestNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
A 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level0.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level1.test
M testdata/workloads/functional-query/queries/QueryTest/ex

[Impala-ASF-CR] IMPALA-5073: Use mmap() instead of malloc() for buffer pool

2017-04-10 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5073: Use mmap() instead of malloc() for buffer pool
..


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6474/7/be/src/runtime/bufferpool/system-allocator.cc
File be/src/runtime/bufferpool/system-allocator.cc:

Line 81:   munmap(mem, fixup);
does this path occur normally? how expensive are these 2 extra munmap()s? hmm, 
I guess linux doesn't support something like MAP_ALIGNED_SUPER.


http://gerrit.cloudera.org:8080/#/c/6474/7/be/src/util/memory-metrics.h
File be/src/util/memory-metrics.h:

PS7, Line 39: in use
what does that mean? i.e .does it include buffers in the free lists and 
TCMalloc free lists?


PS7, Line 158:  Total amount of buffer memory currently allocated.
it's hard to tell from this comment (and naming) if this is the total allocated 
by clients, amount of buffer reservation allocated (in reservation tracker 
terms), or the total system memory allocated (even if in free list).


PS7, Line 159: buffer memory reserved for buffers
not sure what that means. isn't all "buffer memory" reserved for buffers?


-- 
To view, visit http://gerrit.cloudera.org:8080/6474
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifbc748f74adcbbdcfa45f3ec7df98284925acbd6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5124: add tests for scratch read errors

2017-04-10 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5124: add tests for scratch read errors
..


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6562
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f2b93588dd47f70a4863ecad3b5556c3634ccb4
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Dan Hecht 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4883: Union Codegen
..


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6459/8/be/src/codegen/gen_ir_descriptions.py
File be/src/codegen/gen_ir_descriptions.py:

PS8, Line 183:   ["UNION_MATERIALIZE_BATCH",
 :   "_ZN6impala9UnionNode16MaterializeBatchEPNS_8RowBatchEPPh"],
> nit: please consider putting it in a different entry so as not to break up 
Done


http://gerrit.cloudera.org:8080/#/c/6459/8/be/src/exec/union-node.cc
File be/src/exec/union-node.cc:

PS8, Line 223: codegend_union_materialize_batch_fns_.data()
> Why not codegend_union_materialize_batch_fns_[child_idx_] ? Same below.
Done


Line 290:   int num_rows_before = row_batch->num_rows();
> So, in that case, is it true that the rows in the non-empty batch are alrea
Yes, if a batch is not empty, the rows in it should already be counted towards 
num_rows_returned_ at this point. If on line 295, some rows get added the the 
batch, those are not added to num_rows_returned_ yet. num_rows_returned gets 
updated on line 306.

In other words, num_rows_returned gets updated only in one place: line 306.


PS8, Line 304: num_rows_added
> I probably misunderstood something here. Can you please explain why we need
Turns out this formula was wrong. I added a test case added a test case (which 
fails on Patch Set 8 and passes on the latest patch set).


-- 
To view, visit http://gerrit.cloudera.org:8080/6459
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib4107d27582ff5416172810364a6e76d3d93c439
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#9).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M  28.80M  487.22 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS31s245ms1s3

[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Hello Dan Hecht, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6459

to look at the new patch set (#9).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M

[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Hello Dan Hecht, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6459

to look at the new patch set (#9).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M

[Impala-ASF-CR](asf-site) Add Impala docs from branch master, commit hash 68f32e52bc42bef578330a4fe0edc5b292891eea. This is the last commit made by JRussell in the cleanup project.

2017-04-10 Thread Laurel Hale (Code Review)
Laurel Hale has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6604

Change subject: Add Impala docs from branch master, commit hash 
68f32e52bc42bef578330a4fe0edc5b292891eea. This is the last commit made by 
JRussell in the cleanup project.
..

Add Impala docs from branch master,
commit hash 68f32e52bc42bef578330a4fe0edc5b292891eea.
This is the last commit made by JRussell in the cleanup
project.

Change-Id: Ibbf0818c4a7fe1e251e2f36da75cc7c3dd16dead
---
A docs/build/html/commonltr.css
A docs/build/html/commonrtl.css
A docs/build/html/images/impala_arch.jpeg
A docs/build/html/index.html
A docs/build/html/shared/ImpalaVariables.html
A docs/build/html/shared/impala_common.html
A docs/build/html/topics/impala_abort_on_default_limit_exceeded.html
A docs/build/html/topics/impala_abort_on_error.html
A docs/build/html/topics/impala_admin.html
A docs/build/html/topics/impala_admission.html
A docs/build/html/topics/impala_aggregate_functions.html
A docs/build/html/topics/impala_aliases.html
A docs/build/html/topics/impala_allow_unsupported_formats.html
A docs/build/html/topics/impala_alter_table.html
A docs/build/html/topics/impala_alter_view.html
A docs/build/html/topics/impala_analytic_functions.html
A docs/build/html/topics/impala_appx_count_distinct.html
A docs/build/html/topics/impala_appx_median.html
A docs/build/html/topics/impala_array.html
A docs/build/html/topics/impala_auditing.html
A docs/build/html/topics/impala_authentication.html
A docs/build/html/topics/impala_authorization.html
A docs/build/html/topics/impala_avg.html
A docs/build/html/topics/impala_avro.html
A docs/build/html/topics/impala_batch_size.html
A docs/build/html/topics/impala_bigint.html
A docs/build/html/topics/impala_bit_functions.html
A docs/build/html/topics/impala_boolean.html
A docs/build/html/topics/impala_breakpad.html
A docs/build/html/topics/impala_char.html
A docs/build/html/topics/impala_cluster_sizing.html
A docs/build/html/topics/impala_comments.html
A docs/build/html/topics/impala_complex_types.html
A docs/build/html/topics/impala_components.html
A docs/build/html/topics/impala_compression_codec.html
A docs/build/html/topics/impala_compute_stats.html
A docs/build/html/topics/impala_concepts.html
A docs/build/html/topics/impala_conditional_functions.html
A docs/build/html/topics/impala_config.html
A docs/build/html/topics/impala_config_options.html
A docs/build/html/topics/impala_config_performance.html
A docs/build/html/topics/impala_connecting.html
A docs/build/html/topics/impala_conversion_functions.html
A docs/build/html/topics/impala_count.html
A docs/build/html/topics/impala_create_database.html
A docs/build/html/topics/impala_create_function.html
A docs/build/html/topics/impala_create_role.html
A docs/build/html/topics/impala_create_table.html
A docs/build/html/topics/impala_create_view.html
A docs/build/html/topics/impala_databases.html
A docs/build/html/topics/impala_datatypes.html
A docs/build/html/topics/impala_datetime_functions.html
A docs/build/html/topics/impala_ddl.html
A docs/build/html/topics/impala_debug_action.html
A docs/build/html/topics/impala_decimal.html
A docs/build/html/topics/impala_default_order_by_limit.html
A docs/build/html/topics/impala_delegation.html
A docs/build/html/topics/impala_delete.html
A docs/build/html/topics/impala_describe.html
A docs/build/html/topics/impala_development.html
A docs/build/html/topics/impala_disable_codegen.html
A docs/build/html/topics/impala_disable_row_runtime_filtering.html
A docs/build/html/topics/impala_disable_streaming_preaggregations.html
A docs/build/html/topics/impala_disable_unsafe_spills.html
A docs/build/html/topics/impala_disk_space.html
A docs/build/html/topics/impala_distinct.html
A docs/build/html/topics/impala_dml.html
A docs/build/html/topics/impala_double.html
A docs/build/html/topics/impala_drop_database.html
A docs/build/html/topics/impala_drop_function.html
A docs/build/html/topics/impala_drop_role.html
A docs/build/html/topics/impala_drop_stats.html
A docs/build/html/topics/impala_drop_table.html
A docs/build/html/topics/impala_drop_view.html
A docs/build/html/topics/impala_exec_single_node_rows_threshold.html
A docs/build/html/topics/impala_explain.html
A docs/build/html/topics/impala_explain_level.html
A docs/build/html/topics/impala_explain_plan.html
A docs/build/html/topics/impala_faq.html
A docs/build/html/topics/impala_file_formats.html
A docs/build/html/topics/impala_fixed_issues.html
A docs/build/html/topics/impala_float.html
A docs/build/html/topics/impala_functions.html
A docs/build/html/topics/impala_functions_overview.html
A docs/build/html/topics/impala_grant.html
A docs/build/html/topics/impala_group_by.html
A docs/build/html/topics/impala_group_concat.html
A docs/build/html/topics/impala_hadoop.html
A docs/build/html/topics/impala_having.html
A docs/build/html/topics/impala_hbase.html
A docs/build/html/topics/impala_hbase_cache_blocks

[Impala-ASF-CR] IMPALA-4623: Enable file handle cache

2017-04-10 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4623: Enable file handle cache
..


Patch Set 3:

(11 comments)

initial thoughts on the partitioned map.

http://gerrit.cloudera.org:8080/#/c/6478/3/be/src/util/lru-cache.h
File be/src/util/lru-cache.h:

Line 67:   typedef void (*DeleterFn)(Value&);
why now a ref?


Line 82:   /// Variant of Put with move semantics.
is this needed?


Line 91:   bool Pop(const Key& k, Value& out);
we typically don't use refs for this purpose. why the switch?


Line 102:   static void DummyDeleter(Value& v) {}
make private


Line 115:   typedef std::pair ValueType;
value of what?


Line 139: /// Wrapper around FifoMultimap that hashes the keys into a fixed 
number of partitions
which hash fn?


Line 146: template
you can make the number of partitions another template parameter, that way you 
can use a std::array for cache_partitions_. i also don't see a real need for 
the unique_ptr in there. fewer indirection should be faster.

it would also be good to put the individual Fifomultimap objects on cache line 
boundaries. alignas might help here.


Line 151:   /// See FifoMultimap
this doesn't explain how you deal with the remainder.

you could also simplify this and stipulate that capacity is per-partition.


http://gerrit.cloudera.org:8080/#/c/6478/3/be/src/util/lru-cache.inline.h
File be/src/util/lru-cache.inline.h:

Line 36:   const typename MapType::value_type& val =
i think if you use 'using' for MapType instead of typedef you can get rid of 
the typename.


Line 37: std::make_pair(k, lru_list_.emplace(lru_list_.end(), k, 
std::move(v)));
is the explicit move needed for the rvalue v?


Line 51: void FifoMultimap::Put(const Key& k, const Value& v) {
there should be a way not to require basically a verbatim copy of this function


-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3040: Fix test caching ddl test

2017-04-10 Thread Dimitris Tsirogiannis (Code Review)
Dimitris Tsirogiannis has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6603

Change subject: IMPALA-3040: Fix test_caching_ddl test
..

IMPALA-3040: Fix test_caching_ddl test

This commmit adds a 30sec timeout on the validation step of
test_caching_ddl test. This test has been flaky and we suspect a race
between the submission of a cache directive removal and the reported
cached directives from the 'hdfs cacheadmin' utility command.

Change-Id: I3ec4ba5dfae6e90a2bb76e22c93909b05bd78fa4
---
M tests/query_test/test_hdfs_caching.py
1 file changed, 7 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/6603/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I3ec4ba5dfae6e90a2bb76e22c93909b05bd78fa4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis 


[Impala-ASF-CR] IMPALA-5184: build fe against both Hive 1 & 2 APIs

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5184: build fe against both Hive 1 & 2 APIs
..


Patch Set 7:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/5538/7//COMMIT_MSG
Commit Message:

Line 9: This adds a compatibility shim layer with Hive 1 and Hive 2
> Is there a Hive doc summarizing the compatibility changes. Did a quick web 
I don't believe so. I looked at the history of some of these changes and 
couldn't find an overall list.


PS7, Line 16: the
> incomplete
Done


http://gerrit.cloudera.org:8080/#/c/5538/7/bin/impala-config.sh
File bin/impala-config.sh:

Line 134: export IMPALA_HIVE_MAJOR_VERSION=1
> This can be inferred from IMPALA_HIVE_VERSION given the naming scheme is fo
Yeah we probably don't need the flexibility.


Line 377: export HIVE_SRC_DIR=${HIVE_SRC_DIR:-"${HIVE_HOME}/src"}
I ran into the dreaded sticky config variable problem with some of these 
variables switching between versions. This seems like a good time to fix it 
before more Impala devs end up switching between versions.


http://gerrit.cloudera.org:8080/#/c/5538/7/common/thrift/.gitignore
File common/thrift/.gitignore:

Line 4: hive-2-api/TCLIService.thrift
> May be add hive-*-api/TCLIService.thrift (Not sure the exact wildcard synta
The other one is checked in, so any changes to it shouldn't be ignored.

Hive's TCLIService has had various additions to it, so it may make sense at 
some point to refresh this.


http://gerrit.cloudera.org:8080/#/c/5538/7/common/thrift/CMakeLists.txt
File common/thrift/CMakeLists.txt:

Line 189: # The thrift-generated java classes defined in TCLIService are also 
downloaded via maven.
> Wondering why this logic is required instead of adding a copy of hive-2-api
TCLIService.thrift from Hive 2 adds additional RPC endpoints that we don't 
implement.


http://gerrit.cloudera.org:8080/#/c/5538/7/fe/src/compat-hive-1/java/org/apache/impala/compat/MetaStoreCompatShim.java
File 
fe/src/compat-hive-1/java/org/apache/impala/compat/MetaStoreCompatShim.java:

PS7, Line 46: MetaStoreCompatShim
> MetaStoreCompatShim sounds kinda redundant, maybe just MetaStoreShim? I'm n
Done


PS7, Line 68:   
> 4 space indentation, here and multiple places in this file.
Done


http://gerrit.cloudera.org:8080/#/c/5538/7/fe/src/compat-hive-2/java/org/apache/impala/compat/MetaStoreCompatShim.java
File 
fe/src/compat-hive-2/java/org/apache/impala/compat/MetaStoreCompatShim.java:

Line 43:  * A wrapper around some of Hive's metastore API's to abstract away 
differences
> May be add the versions this shim is compatible for? We can get it from the
Done


Line 100:   public static TResultSet execGetFunctions(Frontend frontend,
> Do these belong here? They look similar across both the Shim classes.
The thrift classes moved between packages. The code is identical but the 
classes moved from the cli to rpc package so the imports up the top of the file 
are different. There isn't really a good way to do conditional imports in Java.


http://gerrit.cloudera.org:8080/#/c/5538/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

Line 3092: }
Longline


-- 
To view, visit http://gerrit.cloudera.org:8080/5538
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifbc265281c04fe3136bc3c920dbac966742ce09a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5184: build fe against both Hive 1 & 2 APIs

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#8).

Change subject: IMPALA-5184: build fe against both Hive 1 & 2 APIs
..

IMPALA-5184: build fe against both Hive 1 & 2 APIs

This adds a compatibility shim layer with Hive 1 and Hive 2
implementations. The version-specific code lives in
fe/src/compat-hive-$IMPALA_HIVE_MAJOR_VERSION and
common/thrift/hive-$IMPALA_HIVE_MAJOR_VERSION-api/

The shim adds wrapper methods to handle differing method signatures
and and config variables that changed slightly.

Some thrift classes were also moved from the 'cli' to 'rpc' package.
We work around these by implementing subclasses with the same name
in a different package for compatibility or by implementing shim
methods that operate on the classes. We also need to change the
package in the TCLIService.thrift, which is done with a
search-and-replace.

Also avoid the sticky config variable problem with some of the source
paths by requiring an _OVERRIDE suffix on the variable to override it
from the environment.

Testing:
Made sure that I could build Impala on master as normal, and also
with the following config overrides in bin/impala-config-local.sh:

  export IMPALA_HADOOP_VERSION=3.0.0-alpha1-cdh6.x-SNAPSHOT
  export IMPALA_HBASE_VERSION=2.0.0-cdh6.x-SNAPSHOT
  export IMPALA_HIVE_VERSION=2.1.0-cdh6.x-SNAPSHOT
  export IMPALA_SENTRY_VERSION=1.5.1-cdh6.x-SNAPSHOT
  export IMPALA_PARQUET_VERSION=1.5.0-cdh6.x-SNAPSHOT

I manually assembled the dependencies by copying the following files
from the Hive 2 source and from a Hadoop 3 build:
$CDH_COMPONENTS_HOME/hive-2.1.0-cdh6.x-SNAPSHOT/src/metastore/if/hive_metastore.thrift
$CDH_COMPONENTS_HOME/hadoop-3.0.0-alpha1-cdh6.x-SNAPSHOT/lib/native/*
$CDH_COMPONENTS_HOME/hadoop-3.0.0-alpha1-cdh6.x-SNAPSHOT/include/hdfs.h

Change-Id: Ifbc265281c04fe3136bc3c920dbac966742ce09a
---
M bin/impala-config.sh
M common/thrift/.gitignore
M common/thrift/CMakeLists.txt
R common/thrift/hive-1-api/TCLIService.thrift
M fe/pom.xml
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java
A fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java
A fe/src/compat-hive-1/java/org/apache/impala/compat/MetaStoreShim.java
A fe/src/compat-hive-2/java/org/apache/impala/compat/MetaStoreShim.java
M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java
M fe/src/main/java/org/apache/impala/analysis/CreateDataSrcStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateDbStmt.java
M fe/src/main/java/org/apache/impala/analysis/DropDataSrcStmt.java
M fe/src/main/java/org/apache/impala/analysis/Subquery.java
M fe/src/main/java/org/apache/impala/analysis/TableName.java
M fe/src/main/java/org/apache/impala/analysis/TypeDef.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
M fe/src/test/java/org/apache/impala/analysis/AuthorizationTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/service/FrontendTest.java
27 files changed, 382 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/38/5538/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5538
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifbc265281c04fe3136bc3c920dbac966742ce09a
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bharath Vissapragada 


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change.

Change subject: IMPALA-4883: Union Codegen
..


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6459/8/be/src/codegen/gen_ir_descriptions.py
File be/src/codegen/gen_ir_descriptions.py:

PS8, Line 183:   ["UNION_MATERIALIZE_BATCH",
 :   "_ZN6impala9UnionNode16MaterializeBatchEPNS_8RowBatchEPPh"],
nit: please consider putting it in a different entry so as not to break up 
READ_AVRO* sequence.


http://gerrit.cloudera.org:8080/#/c/6459/8/be/src/exec/union-node.cc
File be/src/exec/union-node.cc:

PS8, Line 223: codegend_union_materialize_batch_fns_.data()
Why not codegend_union_materialize_batch_fns_[child_idx_] ? Same below.


Line 290:   int num_rows_before = row_batch->num_rows();
So, in that case, is it true that the rows in the non-empty batch are already 
counted towards num_rows_returned_ ? Line 303 seems to imply so but I could be 
wrong.


PS8, Line 304: num_rows_added
I probably misunderstood something here. Can you please explain why we need to 
subtract num_rows_added too ?


-- 
To view, visit http://gerrit.cloudera.org:8080/6459
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib4107d27582ff5416172810364a6e76d3d93c439
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3748: minimum buffer requirements in planner

2017-04-10 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3748: minimum buffer requirements in planner
..


Patch Set 13:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/5847/13/be/src/scheduling/query-schedule.cc
File be/src/scheduling/query-schedule.cc:

Line 212
do we still need this?


http://gerrit.cloudera.org:8080/#/c/5847/13/be/src/service/query-exec-state.cc
File be/src/service/query-exec-state.cc:

Line 396:   // Add info strings consumed by CM: Estimated mem and tables 
missing stats.
update


Line 402:   if (query_exec_request.__isset.per_host_min_reservation_bytes) {
can we ditch the _bytes everywhere and declare it's in bytes?


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

Line 252: long perInstanceMinBufferBytes = 2 * SPILLABLE_BUFFER_BYTES;
is this covered by a test case?


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 103:   private final static int MAX_THREAD_TOKENS_PER_CORE = 3;
no idea where that's from, but it seems wrong.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 80: long numPartitions =
rename to perinstance-


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

Line 268: // TODO: add an estimate
remove todo (we already have a todo to remove the estimates)


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/KuduTableSink.java
File fe/src/main/java/org/apache/impala/planner/KuduTableSink.java:

Line 60: if (explainLevel.ordinal() >= TExplainLevel.EXTENDED.ordinal()) {
this if block is in every data sink's explainstring. move it into a function.


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/PipelinedPlanNodeSet.java
File fe/src/main/java/org/apache/impala/planner/PipelinedPlanNodeSet.java:

Line 85: long minBufferBytes = 0L;
minReservationBytes


Line 93:   node.computeResourceProfile(queryOptions);
rather than calling this here, call it after creating plan fragments.


Line 126:   sink.computeResourceProfile(queryOptions);
same here


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/PlanFragment.java
File fe/src/main/java/org/apache/impala/planner/PlanFragment.java:

Line 202:   public int getNumNodes() {
do we still need this?


Line 228: * Estimates the per-fragment-instance number of distinct values 
of exprs based on the
not super-important, but: should it be "per-fragment instance number"?


Line 297:   str.append(String.format("%s:PLAN FRAGMENT [%s]", 
fragmentId_.toString(),
also call getfragmentheaderstring() here


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

Line 276:   TQueryOptions queryOptions, TExplainLevel detailLevel) {
tqueryoptions already contains the explain level. either pass in mt_dop 
explicitly or remove detaillevel.


Line 621:* Compute resources consumed when executing this PlanNode.
also mention that you're populating resourceProfile_


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 344
this disappeared (but the comment for it didn't)


Line 361: long perHostMinReservationBytes = -1;
remove -bytes


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/ResourceProfile.java
File fe/src/main/java/org/apache/impala/planner/ResourceProfile.java:

Line 33:   private final long memBytesEstimate_;
memEstimateBytes_


http://gerrit.cloudera.org:8080/#/c/5847/13/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 241: long perInstanceMinBufferBytes = 3 * SPILLABLE_BUFFER_BYTES;
-reservation instead of bufferbytes


http://gerrit.cloudera.org:8080/#/c/5847/13/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test:

Line 21:  PARALLELPLANS
you should set mt_dop explicitly if you want to test this.

this doesn't look any different than the distributed plan, what are you testing 
for? would be good to point that out at the top of this file.

also, what about single-node plans?


Line 504: Per-Host Resource Reservation: Memory=0B
wha

[Impala-ASF-CR] IMPALA-3973: optional 3rd and 4th arguments for instr().

2017-04-10 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: IMPALA-3973: optional 3rd and 4th arguments for instr().
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5589/2/docs/topics/impala_string_functions.xml
File docs/topics/impala_string_functions.xml:

PS2, Line 335: 1
> Done
What line? I don't see it.


Line 406: select instr('hello world','o',-1);
> Done
What line is the negative position argument and an occurence argument on? 444 
and 447 are invalid input.


http://gerrit.cloudera.org:8080/#/c/5589/4/docs/topics/impala_string_functions.xml
File docs/topics/impala_string_functions.xml:

PS4, Line 352: ++
not exactly: that position does not exist.


PS4, Line 415: 'o', 
fourth


-- 
To view, visit http://gerrit.cloudera.org:8080/5589
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Reviewer: zi+z...@cloudera.com
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3973: optional 3rd and 4th arguments for instr().

2017-04-10 Thread John Russell (Code Review)
John Russell has posted comments on this change.

Change subject: IMPALA-3973: optional 3rd and 4th arguments for instr().
..


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/5589/2/docs/topics/impala_string_functions.xml
File docs/topics/impala_string_functions.xml:

PS2, Line 335: 1
> Same as the "off the end of the string" possibility below. The 'occurrence'
Done


PS2, Line 335: 1
> Can the occurrence argument be 0?
Done


PS2, Line 335: 1
> Can you add a note to say that in this doc?
Done


PS2, Line 349: 7
> What if occurrence is off the end of the string?
Done


PS2, Line 349: 7
> That one's covered by
Done


PS2, Line 395: negative position argument
> That particular circumstance causes an error instead of a zero return value
Done


PS2, Line 395: negative position argument
> Can the occurrence argument be zero or negative?
Done


PS2, Line 395: negative position argument
> That's intentional. In fact, initially I returned 0 in this case as well, b
OK. The "UDF error" string in the error message gave me a different impression 
than when I see other errors from built-in functions. Usually the messages 
downplay the fact that built-in functions are just UDFs behind the scenes.


PS2, Line 395: negative position argument
> Can the behavior be explicitly called out in this patch's contents?
Done


Line 406: select instr('hello world','o',-1);
> I do not see a patch set 3 here yet. Are you planning on sending that in a 
Done


Line 406: select instr('hello world','o',-1);
> Sure. In general, we try to be more detailed than Hive when it comes to pro
Done


Line 406: select instr('hello world','o',-1);
> Can you show an example with a negative position argument and an occurrence
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5589
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Reviewer: zi+z...@cloudera.com
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3973: optional 3rd and 4th arguments for instr().

2017-04-10 Thread John Russell (Code Review)
Hello Zoltan Ivanfi,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5589

to look at the new patch set (#4).

Change subject: IMPALA-3973: optional 3rd and 4th arguments for instr().
..

IMPALA-3973: optional 3rd and 4th arguments for instr().

Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
---
M docs/topics/impala_string_functions.xml
1 file changed, 165 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/5589/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5589
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Reviewer: zi+z...@cloudera.com


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4883: Union Codegen
..


Patch Set 8: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6459
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib4107d27582ff5416172810364a6e76d3d93c439
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3973: optional 3rd and 4th arguments for instr().

2017-04-10 Thread John Russell (Code Review)
Hello Zoltan Ivanfi,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5589

to look at the new patch set (#3).

Change subject: IMPALA-3973: optional 3rd and 4th arguments for instr().
..

IMPALA-3973: optional 3rd and 4th arguments for instr().

Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
---
M docs/topics/impala_string_functions.xml
1 file changed, 162 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/5589/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5589
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-Reviewer: zi+z...@cloudera.com


[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

2017-04-10 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
..


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6107/6//COMMIT_MSG
Commit Message:

PS6, Line 11: Impala
wouldn't this bug mean that no one could read these files? i.e. they were 
completely invalid.


http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS6, Line 223: i
what is 'i'? is this talking about 'val'?


PS6, Line 248: buff
buf


PS6, Line 259: buff
buf


PS6, Line 260: buff
buf


http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/util/compress.cc
File be/src/util/compress.cc:

PS6, Line 209: its
what is this referring to? size of what?
and i think this comment would be clearer if reworded to explain the block 
encoding in order (rather than "preceded with").


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5031: Remove undefined behavior: left shift of large signed

2017-04-10 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: IMPALA-5031: Remove undefined behavior: left shift of large 
signed
..


Patch Set 2:

There are 16 open blocker bugs right now, so I'm going to hold off on 
submitting this for a few days.



-- 
To view, visit http://gerrit.cloudera.org:8080/6528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I748697cf503e9e717a6e95250c2cbbf031c6352d
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR](asf-site) First publishing of Apache Impala (incubating) documentation to the Apache web site.

2017-04-10 Thread Laurel Hale (Code Review)
Laurel Hale has abandoned this change.

Change subject: First publishing of Apache Impala (incubating) documentation to 
the Apache web site.
..


Abandoned

There were a couple problems with the publish procedure I followed so will 
abandon this review and re-do it.

-- 
To view, visit http://gerrit.cloudera.org:8080/6601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: If33c03c9d6e64f3974756df50b0777eeb1333ed8
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: asf-site
Gerrit-Owner: Laurel Hale 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Laurel Hale 
Gerrit-Reviewer: Michael Brown 


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#8).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M  28.80M  487.22 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS31s245ms1s3

[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6459

to look at the new patch set (#8).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M  28.80M  4

[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4883: Union Codegen
..


Patch Set 7:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6459/7/be/src/exec/union-node-ir.cc
File be/src/exec/union-node-ir.cc:

PS7, Line 34: RuntimeState* state
> Is this not used at all ?
Good catch, removed.


PS7, Line 45: child_row_idx < num_child_batch_rows) {
> There is a macro called FOREACH_ROW() which is useful for iterating through
Done


http://gerrit.cloudera.org:8080/#/c/6459/6/be/src/exec/union-node.h
File be/src/exec/union-node.h:

PS6, Line 71: the pointer to the mem pool is hard coded in the codegen'
> I couldn't find one, but I think Michael knows more about this.
Created IMPALA-5192


-- 
To view, visit http://gerrit.cloudera.org:8080/6459
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib4107d27582ff5416172810364a6e76d3d93c439
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6459

to look at the new patch set (#8).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M  28.80M  4

[Impala-ASF-CR](asf-site) First publishing of Apache Impala (incubating) documentation to the Apache web site.

2017-04-10 Thread Laurel Hale (Code Review)
Laurel Hale has posted comments on this change.

Change subject: First publishing of Apache Impala (incubating) documentation to 
the Apache web site.
..


Patch Set 1:

> (1 comment)

I'm going to abandon this review because of an issue that Jim identified. The 
next one I do, I'll be sure to include the commit id used for the build.

-- 
To view, visit http://gerrit.cloudera.org:8080/6601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If33c03c9d6e64f3974756df50b0777eeb1333ed8
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: asf-site
Gerrit-Owner: Laurel Hale 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Laurel Hale 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: No


[Impala-ASF-CR](asf-site) First publishing of Apache Impala (incubating) documentation to the Apache web site.

2017-04-10 Thread Laurel Hale (Code Review)
Laurel Hale has posted comments on this change.

Change subject: First publishing of Apache Impala (incubating) documentation to 
the Apache web site.
..


Patch Set 1:

> (1 comment)

I'll abandon this review and start over. Will delete the contents of the 
docs/build directory before I build.

-- 
To view, visit http://gerrit.cloudera.org:8080/6601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If33c03c9d6e64f3974756df50b0777eeb1333ed8
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: asf-site
Gerrit-Owner: Laurel Hale 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Laurel Hale 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4883: Union Codegen

2017-04-10 Thread Taras Bobrovytsky (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6459

to look at the new patch set (#8).

Change subject: IMPALA-4883: Union Codegen
..

IMPALA-4883: Union Codegen

For each non-passthrough child of the Union node, codegen the loop that
does per row tuple materialization.

Testing:
Ran test_queries.py test locally in exchaustive mode.

Benchmark:
Ran a local benchmark on a local 10 GB TPCDS dataset on an unpartitioned
store_sales table.

SELECT
  COUNT(c),
  COUNT(ss_customer_sk),
  COUNT(ss_cdemo_sk),
  COUNT(ss_hdemo_sk),
  COUNT(ss_addr_sk),
  COUNT(ss_store_sk),
  COUNT(ss_promo_sk),
  COUNT(ss_ticket_number),
  COUNT(ss_quantity),
  COUNT(ss_wholesale_cost),
  COUNT(ss_list_price),
  COUNT(ss_sales_price),
  COUNT(ss_ext_discount_amt),
  COUNT(ss_ext_sales_price),
  COUNT(ss_ext_wholesale_cost),
  COUNT(ss_ext_list_price),
  COUNT(ss_ext_tax),
  COUNT(ss_coupon_amt),
  COUNT(ss_net_paid),
  COUNT(ss_net_paid_inc_tax),
  COUNT(ss_net_profit),
  COUNT(ss_sold_date_sk)
FROM (
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
  union all
  select fnv_hash(ss_sold_time_sk) c, * from 
tpcds_10_parquet.store_sales_unpartitioned
) t

Before: 39s704ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  194.504us  194.504us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   17.284us   17.284us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s202ms2s934ms3   1  115.00 KB  
 10.00 MB
00:UNION   3   32s514ms   34s926ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS3  158.373ms  216.085ms   28.80M  28.80M  489.71 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS3  167.002ms  171.738ms   28.80M  28.80M  489.74 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS3  125.331ms  145.496ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--05:SCAN HDFS3  148.478ms  194.311ms   28.80M  28.80M  489.69 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--06:SCAN HDFS3  143.995ms  162.781ms   28.80M  28.80M  489.57 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--07:SCAN HDFS3  169.731ms  250.201ms   28.80M  28.80M  489.58 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--08:SCAN HDFS3  164.110ms  254.374ms   28.80M  28.80M  489.61 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--09:SCAN HDFS3  135.631ms  162.117ms   28.80M  28.80M  489.63 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--10:SCAN HDFS3  138.736ms  167.778ms   28.80M  28.80M  489.67 MB  
  1.88 GB  tpcds_10_parquet.store_sales
01:SCAN HDFS   3  202.015ms  248.728ms   28.80M  28.80M  489.68 MB  
  1.88 GB  tpcds_10_parquet.store_sales

After: 20s177ms
Operator  #Hosts   Avg Time   Max Time#Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail
--
13:AGGREGATE   1  174.617us  174.617us1   1   28.00 KB  
  -1.00 B  FINALIZE
12:EXCHANGE1   16.693us   16.693us3   1  0  
  -1.00 B  UNPARTITIONED
11:AGGREGATE   32s830ms3s615ms3   1  115.00 KB  
 10.00 MB
00:UNION   34s296ms5s258ms  288.01M 288.01M3.08 MB  
0
|--02:SCAN HDFS31s212ms1s340ms   28.80M  28.80M  488.82 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--03:SCAN HDFS31s387ms1s570ms   28.80M  28.80M  489.37 MB  
  1.88 GB  tpcds_10_parquet.store_sales
|--04:SCAN HDFS31s224ms1s347ms   28.80M  28.80M  4

[Impala-ASF-CR] IMPALA-5184: build fe against both Hive 1 & 2 APIs

2017-04-10 Thread Bharath Vissapragada (Code Review)
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5184: build fe against both Hive 1 & 2 APIs
..


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5538/7//COMMIT_MSG
Commit Message:

Line 9: This adds a compatibility shim layer with Hive 1 and Hive 2
Is there a Hive doc summarizing the compatibility changes. Did a quick web 
search, couldn't find any.


PS7, Line 16: the
incomplete


http://gerrit.cloudera.org:8080/#/c/5538/7/bin/impala-config.sh
File bin/impala-config.sh:

Line 134: export IMPALA_HIVE_MAJOR_VERSION=1
This can be inferred from IMPALA_HIVE_VERSION given the naming scheme is 
followed? (Unless we want to support a combination of different 
IMPALA_HIVE_MAJOR_VERSION and IMPALA_HIVE_VERSION, which I don't think is the 
case)


http://gerrit.cloudera.org:8080/#/c/5538/7/common/thrift/.gitignore
File common/thrift/.gitignore:

Line 4: hive-2-api/TCLIService.thrift
May be add hive-*-api/TCLIService.thrift (Not sure the exact wildcard syntax 
works for .gitignore)? I don't think we are expected to change Hive's spec 
anyway.


http://gerrit.cloudera.org:8080/#/c/5538/7/common/thrift/CMakeLists.txt
File common/thrift/CMakeLists.txt:

Line 189: # The thrift-generated java classes defined in TCLIService are also 
downloaded via maven.
Wondering why this logic is required instead of adding a copy of 
hive-2-api/TCLIService.thrift from hive-2's git repo just like how we maintain 
a copy of hive-1-api/thrift. Whoever is compiling can just toggle 
IMPALA_HIVE_MAJOR_VERSION and use the appropriate thrift file. Am I missing 
something?


http://gerrit.cloudera.org:8080/#/c/5538/7/fe/src/compat-hive-1/java/org/apache/impala/compat/MetaStoreCompatShim.java
File 
fe/src/compat-hive-1/java/org/apache/impala/compat/MetaStoreCompatShim.java:

PS7, Line 46: MetaStoreCompatShim
MetaStoreCompatShim sounds kinda redundant, maybe just MetaStoreShim? I'm not 
too strong about this, up to you :).


PS7, Line 68:   
4 space indentation, here and multiple places in this file.


http://gerrit.cloudera.org:8080/#/c/5538/7/fe/src/compat-hive-2/java/org/apache/impala/compat/MetaStoreCompatShim.java
File 
fe/src/compat-hive-2/java/org/apache/impala/compat/MetaStoreCompatShim.java:

Line 43:  * A wrapper around some of Hive's metastore API's to abstract away 
differences
May be add the versions this shim is compatible for? We can get it from the 
package name, but I think it is worthy of a class comment (Here and the other 
shim too).


Line 100:   public static TResultSet execGetFunctions(Frontend frontend,
Do these belong here? They look similar across both the Shim classes.


-- 
To view, visit http://gerrit.cloudera.org:8080/5538
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifbc265281c04fe3136bc3c920dbac966742ce09a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5031: Remove undefined behavior: left shift of large signed

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5031: Remove undefined behavior: left shift of large 
signed
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I748697cf503e9e717a6e95250c2cbbf031c6352d
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] [DOCS] Add placeholder for DECIMAL V2 query option

2017-04-10 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: [DOCS] Add placeholder for DECIMAL_V2 query option
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7a3bbd0d56da548d8d42e8ef3b71f2c014bc822d
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Greg Rahn 
Gerrit-HasComments: No


[Impala-ASF-CR] [DOCS] Add placeholder for DECIMAL V2 query option

2017-04-10 Thread Greg Rahn (Code Review)
Greg Rahn has posted comments on this change.

Change subject: [DOCS] Add placeholder for DECIMAL_V2 query option
..


Patch Set 2: Code-Review+1

lgtm

-- 
To view, visit http://gerrit.cloudera.org:8080/6596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7a3bbd0d56da548d8d42e8ef3b71f2c014bc822d
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Greg Rahn 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4192: Disentangle Expr and ExprContext

2017-04-10 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change.

Change subject: IMPALA-4192: Disentangle Expr and ExprContext
..


Patch Set 6:

I will post a new patch once the draft has gotten further along. Please feel 
free to wait till then to review.

-- 
To view, visit http://gerrit.cloudera.org:8080/5483
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iefdc9aeeba033355cb9497e3a5d2363627dcf2f3
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5184: build fe against both Hive 1 & 2 APIs

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#7).

Change subject: IMPALA-5184: build fe against both Hive 1 & 2 APIs
..

IMPALA-5184: build fe against both Hive 1 & 2 APIs

This adds a compatibility shim layer with Hive 1 and Hive 2
implementations. The version-specific code lives in
fe/src/compat-hive-$IMPALA_HIVE_MAJOR_VERSION and
common/thrift/hive-$IMPALA_HIVE_MAJOR_VERSION-api/

The shim adds wrapper methods to handle differing method signatures
and and config variables that changed slightly.
For these, MetastoreCompatShim has methods that abstract away the

Some thrift classes were also moved from the 'cli' to 'rpc' package.
We work around these by implementing subclasses with the same name
in a different package for compatibility or by implementing shim
methods that operate on the classes. We also need to change the
package in the TCLIService.thrift, which is done with a
search-and-replace.

Testing:
Made sure that I could build Impala on master as normal, and also
with the following config overrides in bin/impala-config-local.sh:

  export IMPALA_HADOOP_VERSION=3.0.0-alpha1-cdh6.x-SNAPSHOT
  export IMPALA_HBASE_VERSION=2.0.0-cdh6.x-SNAPSHOT
  export IMPALA_HIVE_MAJOR_VERSION=2
  export IMPALA_HIVE_VERSION=2.1.1-cdh6.x-SNAPSHOT
  export IMPALA_SENTRY_VERSION=1.5.1-cdh6.x-SNAPSHOT
  export IMPALA_PARQUET_VERSION=1.5.0-cdh5.10.1
  export KUDU_JAVA_VERSION=1.4.0-cdh6.x-SNAPSHOT

I manually assembled the dependencies by copying the following files
from the Hive 2 source and from a Hadoop 3 build:
$CDH_COMPONENTS_HOME/hive-2.1.0-cdh6.x-SNAPSHOT/src/metastore/if/hive_metastore.thrift
$CDH_COMPONENTS_HOME/hadoop-3.0.0-alpha1-cdh6.x-SNAPSHOT/lib/native/*
$CDH_COMPONENTS_HOME/hadoop-3.0.0-alpha1-cdh6.x-SNAPSHOT/include/hdfs.h

Change-Id: Ifbc265281c04fe3136bc3c920dbac966742ce09a
---
M bin/impala-config.sh
M common/thrift/.gitignore
M common/thrift/CMakeLists.txt
R common/thrift/hive-1-api/TCLIService.thrift
M fe/pom.xml
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java
A fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java
A 
fe/src/compat-hive-1/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java
A fe/src/compat-hive-1/java/org/apache/impala/compat/MetaStoreCompatShim.java
A fe/src/compat-hive-2/java/org/apache/impala/compat/MetaStoreCompatShim.java
M fe/src/main/java/org/apache/impala/analysis/ColumnDef.java
M fe/src/main/java/org/apache/impala/analysis/CreateDataSrcStmt.java
M fe/src/main/java/org/apache/impala/analysis/CreateDbStmt.java
M fe/src/main/java/org/apache/impala/analysis/DropDataSrcStmt.java
M fe/src/main/java/org/apache/impala/analysis/Subquery.java
M fe/src/main/java/org/apache/impala/analysis/TableName.java
M fe/src/main/java/org/apache/impala/analysis/TypeDef.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
M fe/src/test/java/org/apache/impala/analysis/AuthorizationTest.java
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M fe/src/test/java/org/apache/impala/service/FrontendTest.java
27 files changed, 390 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/38/5538/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5538
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifbc265281c04fe3136bc3c920dbac966742ce09a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4817: Populate Parquet Statistics for Strings

2017-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
..


Patch Set 2:

(7 comments)

I had a few high-level comments/questions.

http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG
Commit Message:

Line 25: TODO: This change still needs tests reading statistics from files which
We'll probably need to check in some files written with the old stats, right? 
Since we'll switch the version of Hive at some point. Might be best to just do 
this now so that we've got the test coverage.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 559: // TODO: Here we need to pass in the column order (signed vs 
unsigned).
Are you going to fix that in the CR?


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 44: /// value (as opposed to their binary representation).
Add a comment to describe ordering for strings, to be consistent with 
specifying these orders.


Line 150: class ColumnStats : public TypedColumnStatsBase {
I'm confused why we need a three-level hierarchy with additional template 
specialisation. I don't understand what having the TypedColumnStatsBase level 
to the hierarchy buys us.

If we're going to use template specialisation it seems like we should be able 
to collapse ColumnStats and TypedColumnStatsBase into one somehow


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 103:   // TODO: How to do memory management here?
Fixed?


http://gerrit.cloudera.org:8080/#/c/6563/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 437: """Test that Impala does not write statistics for char columns."""
Comment out-of-date?


Line 445: ]
Can we add some tests involving "interesting" characters. E.g. '\0' and some 
bytes >= 128. I think these should "just work" in our implementation but it 
would be good to have the coverage, since these things were broken with the old 
Hive stats.


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR](asf-site) First publishing of Apache Impala (incubating) documentation to the Apache web site.

2017-04-10 Thread Michael Brown (Code Review)
Michael Brown has posted comments on this change.

Change subject: First publishing of Apache Impala (incubating) documentation to 
the Apache web site.
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6601/1//COMMIT_MSG
Commit Message:

PS1, Line 7: First publishing of Apache Impala (incubating) documentation
   : to the Apache web site.
It would be good to know which commit hash these docs come from. I'll update 
the instructions to make sure to include that.


-- 
To view, visit http://gerrit.cloudera.org:8080/6601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If33c03c9d6e64f3974756df50b0777eeb1333ed8
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: asf-site
Gerrit-Owner: Laurel Hale 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: Yes


[Impala-ASF-CR](asf-site) First publishing of Apache Impala (incubating) documentation to the Apache web site.

2017-04-10 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: First publishing of Apache Impala (incubating) documentation to 
the Apache web site.
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6601/1/docs/build/html/topics/impala_howto_rm.html
File docs/build/html/topics/impala_howto_rm.html:

I don't see the corresponding XML file in the repo, and this talks about CM. I 
also don't see this when I follow the instructions:

https://cwiki.apache.org/confluence/display/IMPALA/Pushing+Impala+docs+to+impala.apache.org

I think this is leftover. I'd suggest you abandon this patch, remove the 
docs/build director and all of its contents on your local machine, and build 
again. My guess is this is from an old build.

I'll change the directions to include that the user should delete their 
docs/build directory.


-- 
To view, visit http://gerrit.cloudera.org:8080/6601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If33c03c9d6e64f3974756df50b0777eeb1333ed8
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: asf-site
Gerrit-Owner: Laurel Hale 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: John Russell 
Gerrit-Reviewer: Michael Brown 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4166: Add SORT BY sql clause

2017-04-10 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4166: Add SORT BY sql clause
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6495/9/fe/src/main/java/org/apache/impala/analysis/TablePropertyAnalyzer.java
File fe/src/main/java/org/apache/impala/analysis/TablePropertyAnalyzer.java:

Line 114:   if (!foundColumn) {
> Yes, currently that's the case. I think we discussed this in person once, t
why not apply the drop/rename to the sort-by list?

(why is rename an issue? do you not store the sorting columns as integer 
indices into the table column list?)


-- 
To view, visit http://gerrit.cloudera.org:8080/6495
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I08834f38a941786ab45a4381c2732d929a934f75
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4817: Populate Parquet Statistics for Strings

2017-04-10 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
..


Patch Set 2:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 541: const string* thrift_stats = nullptr;
bad variable name: this is a plain-encoded value. thrift_stats sounds like it's 
a struct (it's definitely not something that requires a plural).


Line 546:   thrift_stats = ParquetMetadataUtils::GetThriftStats(
why did you break this up instead of having ReadFromThrift do the extra work? 
do you need GetThriftStats anywhere else? the old control flow was easier to 
follow.


Line 556: if (!thrift_stats) continue;
explicit comparison


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: /// This class, together with its derivatives, is used to track column 
statistics when
track is really not a meaningful term here, it generally just means 'follow'. 
revise description.


Line 36: /// We currently support tracking 'min_value' and 'max_value' values 
for statistics. The
hopefully also min and max, no? tracking means reading/decoding.


Line 46: /// We currently don't write statistics for DECIMAL values and 
TIMESTAMP values due to
why is that still not the case?


Line 66:   const string& thrift_stats, const ColumnType& col_type, void* 
slot);
you changed the type and meaning of a parameter, but you didn't change the name.


Line 72:   /// Creates a copy of the contents of this object. Some data types 
(e.g. StringValue)
unclear what this does, because copies are usually returned or passed into 
something.


Line 146: /// This class contains further type-specific behavior that is common 
only to a subset of
why can't this be collapsed into a 2-level hierarchy?


Line 151:  protected:
protected follows public section


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 34: inline int64_t TypedColumnStatsBase::BytesNeeded() const {
hard to compare, please move the functions back to where they were. feel free 
to reorder at the end of the review cycle.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 243: int col_idx, const StatsField& stats_field, const ColumnType& 
col_type) {
col_idx is unused.

instead of passing in both the columnchunk and columntype, why not use 
col_chunk.meta_data.type?


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 60:   static bool ReadOldStats(const ColumnType& col_type);
this sounds like it's doing some reading.

also, 'deprecated' instead of old.


http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 344:   BROTLI = 4;
do we return an error when we see that codec?


Line 567: /** Union containing the order used for min, max, and sorting values 
in a column
why isn't this implied by the logical type of that column?

if this is simply to mark "legacy" ordered columns, then why not simply have a 
bool here?


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4817: Populate Parquet Statistics for Strings

2017-04-10 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 608:* be Signed. In addition, min and max values for INTERVAL or 
DECIMAL stored
> I'm not sure I understand your question here. Thinking about it, it is also
should this also say something about strings?


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
..


Patch Set 6:

(2 comments)

Thanks for the review.

http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS6, Line 247: DCHECK(val < -112);
> nit: DCHECK_LT(val, -112);
Done


PS6, Line 258: DCHECK(val > 127);
> nit: DCHECK_GT(val, 127);
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 657 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 657 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 71:   @pytest.mark.execute_serially
> Moved tests that use startup options to ./tests/custom_cluster/test_hive_pa
The rest of the tests were moved to 
./tests/query_test/test_parquet_timestamp_compatibility.py


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..

IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.

After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.

New tables created by Impala will set the table property to UTC if the
global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
set to true.

Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not
set the table property unless the global flag is set to true.

Tables created using CREATE TABLE LIKE  will copy the
property of the table that is copied.

This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.

Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/fe-support.cc
M be/src/service/impala-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
M fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
M fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M tests/common/impala_test_suite.py
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/metadata/test_ddl.py
M tests/metadata/test_ddl_base.py
A tests/query_test/test_parquet_timestamp_compatibility.py
26 files changed, 657 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/5939/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

2017-04-10 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
..


Patch Set 5:

(26 comments)

Thanks for the review.

http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 305:   // parquet-mr.
> Comment that the FE guarantees that this is a valid timezone.
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 246:   DCHECK((is_timestamp_dependent_timezone_ && timezone_ == NULL) 
||
> simplify condition:
Done


Line 549:   /// Used to cache the timezone object corresponding to 
"parquet.mr.int96.write.zone"
> the "parquet.mr.int96.write.zone" table property
Done


Line 550:   /// table property to avoid repeated calls to 
'TimezoneDatabase::FindTimezone()'.
> no need to single-quote here
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 190:   // See if they specified a zone id
> Who is "they"? Suggest rephrasing
Done


Line 202:   return time_zone_ptr();
> return NULL?
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/BackendGflags.thrift
File common/thrift/BackendGflags.thrift:

Line 58:   // If true, all newly created Parquet tables will have the 
parquet.mr.int96.write.zone
> ... all newly created HDFS tables regardless of format ...
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 218:   9: optional string parquet_mr_write_zone;
> remove trailing semicolon for consistency
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 169:   private static void analyzeParquetMrWriteZone(Table table,
> These cases need tests in AnalyzeDDL
Done. I also changed the exception message for consistency.


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 112:   throw new AnalysisException("Invalid Time Zone: " + timezone);
> Mention that the timezone is in the table property 'parquet.mr.int96.write.
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 183: if 
(getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE)) {
> These cases need tests in AnalyzeDDL
Done


Line 186: "Only HDFS tables can use '%s' table property.",
> the '%s' table property
Done. I also changed the exception message for consistency.


http://gerrit.cloudera.org:8080/#/c/5939/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 54: import org.apache.impala.common.InternalException;
> not needed
Done


http://gerrit.cloudera.org:8080/#/c/5939/5/tests/custom_cluster/test_parquet_timestamp_compatibility.py
File tests/custom_cluster/test_parquet_timestamp_compatibility.py:

Line 25: class TestParquetTimestampCompatibility(CustomClusterTestSuite):
> Test needs a comment. In particular, what is covered here and what is cover
I've added a comment that describes tests in this file.

What do you mean by "integration tests not part of Impala tests"? I don't think 
we have any integration tests upstream.


Line 26:   TEST_DB_NAME = 'timezone_test_db'
> Why not use the unique_database fixture? That's the best practice.
Switched to unique_database.


Line 39: self.client = impalad.service.create_beeswax_client()
> I think we already have a self.client from ImpalaTestSuite.
Removed creating a client from here.


Line 49: 
"/test-warehouse/alltypesagg_hive_13_1_parquet/alltypesagg_hive_13_1.parquet"
> We need the appropriate filesystem prefix here. This will break on local fs
Done


Line 56: self.client.execute('USE ' + self.TEST_DB_NAME)
> Why? Better to avoid changing the session state
Removed it.


Line 59: self.client.execute('INVALIDATE METADATA')
> why?
Removed it, it's not needed.


Line 61:   def _set_impala_table_timezone(self, timezone):
> simplify to pass table name as param
Done


Line 71:   @pytest.mark.execute_serially
> How long does this test take? I'm thinking we should only have minimal test
Moved tests that use startup options to 
./tests/custom_cluster/test_hive_parquet_timestamp_conversion.py


Line 81: statement = '''ALTER TABLE hive_table
> Should be an analyzer test in AnalyzeDDL
Removed it from here.


Line 104: select_from_impala_table = '''SELECT timestamp_col FROM 
impala_table WHERE id = 1'''
> Can you th

[Impala-ASF-CR] IMPALA-4817: Populate Parquet Statistics for Strings

2017-04-10 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 608:* be Signed. In addition, min and max values for INTERVAL or 
DECIMAL stored
> string?
I'm not sure I understand your question here. Thinking about it, it is also 
unclear to me what the second sentence is supposed to say, so I added a comment 
to the upstream PR.


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5137: Support Kudu UNIXTIME MICROS as Impala TIMESTAMP

2017-04-10 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-5137: Support Kudu UNIXTIME_MICROS as Impala TIMESTAMP
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6526/3/be/src/exec/kudu-table-sink.cc
File be/src/exec/kudu-table-sink.cc:

Line 302:   DCHECK(success) << "Invalid TimestampValue";
This DCHECK effectively tests tv->HasDateAndTime(). Is there any way this could 
fail? In a release build we would read the uninitialized ts_micros in the line 
below. Would it be an option to use RETURN_IF_FALSE here? It could also be 
inlined then.


http://gerrit.cloudera.org:8080/#/c/6526/3/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

Line 125: String colName = desc.getColumn().getName();
Nit: Now that I look at this you could also inline it. But I'm don't feel 
strongly about it at all, so feel free to keep it like it is if you prefer it.


http://gerrit.cloudera.org:8080/#/c/6526/2/fe/src/main/java/org/apache/impala/util/KuduUtil.java
File fe/src/main/java/org/apache/impala/util/KuduUtil.java:

Line 186:   default:
Where did this go?


-- 
To view, visit http://gerrit.cloudera.org:8080/6526
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iae6ccfffb79118a9036fb2227dba3a55356c896d
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-HasComments: Yes


[native-toolchain-CR] Ported native-toolchain to work on ppc64le

2017-04-10 Thread Valencia Edna Serrao (Code Review)
Valencia Edna Serrao has posted comments on this change.

Change subject: Ported native-toolchain to work on ppc64le
..


Patch Set 1:

> > > @Matthew: Would you be needing the ppc64le infra for temporary
 > > > usage to test the changes until Cloudera/Apache arranges its
 > own
 > > > node or is it needed permanently for maintenance of ppc64le
 > code?

 > >
 > > We don't have a way to allocate a ppc64le node in ou pre-commit
 > > testing, which uses Jenkins on AWS. We will need a way to test
 > > ppc64le works as long as we want to support it.
 > >
 > > This means a physical node outside of AWS that Jenkins can talk
 > to
 > > or an emulator on an AWS node that Jenkins can talk to.
 > >
 > > And this needs to exist not just for testing your new patchset
 > but
 > > indefinitely.
 > 
 > Also, this discussion should probably happen on dev@, not here.
 > Patch comment discussions are probably read by fewer people. Can
 > you start your discussion on testing on dev@, please?

Thanks, Jim! Surely I'll start the discussion on dev@ too.

-- 
To view, visit http://gerrit.cloudera.org:8080/6468
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7931752ca497bc7a5e3cc574bbb54637f382c72a
Gerrit-PatchSet: 1
Gerrit-Project: native-toolchain
Gerrit-Branch: master
Gerrit-Owner: Valencia Edna Serrao 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Valencia Edna Serrao 
Gerrit-HasComments: No