[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 6: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 23 Sep 2019 23:14:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Reviewed-on: http://gerrit.cloudera.org:8080/14247
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 92 insertions(+), 4 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4990/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 23 Sep 2019 19:03:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 6: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 23 Sep 2019 19:03:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 23 Sep 2019 19:01:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Alex Rodoni (Code Review)
Alex Rodoni has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 5: Code-Review+1

The doc looks fine.


--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 23 Sep 2019 19:00:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14247/4/docs/topics/impala_s3.xml
File docs/topics/impala_s3.xml:

http://gerrit.cloudera.org:8080/#/c/14247/4/docs/topics/impala_s3.xml@736
PS4, Line 736: As
> A line break?
Done



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Mon, 23 Sep 2019 14:44:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-23 Thread Sahil Takiar (Code Review)
Hello Alex Rodoni, David Rorke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14247

to look at the new patch set (#5).

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 92 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/14247/5
--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-20 Thread Alex Rodoni (Code Review)
Alex Rodoni has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14247/4/docs/topics/impala_s3.xml
File docs/topics/impala_s3.xml:

http://gerrit.cloudera.org:8080/#/c/14247/4/docs/topics/impala_s3.xml@736
PS4, Line 736: As
A line break?



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 20 Sep 2019 20:37:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-20 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 4: Code-Review+2

This looks good to me. Check if other reviewers have anything, but I'm going 
ahead with +2.


--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 20 Sep 2019 20:12:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-20 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/14247/3/be/src/service/query-options.cc@881
PS3, Line 881: int64_t parquet_object_store_split_size;
 : RETURN_IF_ERROR(ParseMemValue(
 : value, "parquet object store split size", 
_object_store_split_size));
 : query_options->__set_parquet_object_store_split_size(
 : parquet_object_store_split_size);
 : break;
> We won't respect a value below 1MB (due to MIN_SYNTHETIC_BLOCK_SIZE), but d
Done


http://gerrit.cloudera.org:8080/#/c/14247/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14247/3/tests/query_test/test_scanners.py@995
PS3, Line 995: num_row_groups_counters = re.findall(r'NumRowGroups: (\d+)', 
result.runtime_profile)
 : assert len(num_row_groups_counters) > 1
> I think this test would fail if the code wasn't working. I think there are
Done



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 20 Sep 2019 18:02:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-20 Thread Sahil Takiar (Code Review)
Hello Alex Rodoni, David Rorke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14247

to look at the new patch set (#4).

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 93 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/14247/4
--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-19 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 3: Code-Review+1

(2 comments)

I think this is good. I had a couple small comments.

http://gerrit.cloudera.org:8080/#/c/14247/3/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/14247/3/be/src/service/query-options.cc@881
PS3, Line 881: int64_t parquet_object_store_split_size;
 : RETURN_IF_ERROR(ParseMemValue(
 : value, "parquet object store split size", 
_object_store_split_size));
 : query_options->__set_parquet_object_store_split_size(
 : parquet_object_store_split_size);
 : break;
We won't respect a value below 1MB (due to MIN_SYNTHETIC_BLOCK_SIZE), but do we 
want to force this option to be above a certain value? I don't think people 
will use this query option, and it is marked advanced, so this may not matter 
much.


http://gerrit.cloudera.org:8080/#/c/14247/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14247/3/tests/query_test/test_scanners.py@995
PS3, Line 995: num_row_groups_counters = re.findall(r'NumRowGroups: (\d+)', 
result.runtime_profile)
 : assert len(num_row_groups_counters) > 1
I think this test would fail if the code wasn't working. I think there are ways 
to be more direct about what you are measuring.

I haven't tested this on s3, but I think comparing the "NumRowGroups:" to the 
"ScanRangesComplete:" would tell you if there are any scan ranges that don't 
contain a row group midpoint. With your change, since we don't have Parquet 
files larger than 256MB, the two should be the same. The other thing that might 
show the difference is "NumScannersWithNoReads". With your change, that should 
always be zero, and I think it would be non-zero without your change.



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 19 Sep 2019 23:00:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-18 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14247/1/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/14247/1/be/src/service/query-options.h@192
PS1, Line 192:   QUERY_OPT_FN(parquet_object_store_split_size, 
PARQUET_OBJECT_STORE_SPLIT_
> Since this is primarily used by object stores (I think the only exception i
Done


http://gerrit.cloudera.org:8080/#/c/14247/2/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/14247/2/be/src/service/query-options.cc@884
PS2, Line 884: 
query_options->__set_parquet_object_store_split_size(parquet_object_store_split_size);
> line too long (94 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Wed, 18 Sep 2019 22:48:43 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-18 Thread Sahil Takiar (Code Review)
Hello Alex Rodoni, David Rorke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14247

to look at the new patch set (#3).

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 69 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/14247/3
--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14247/2/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/14247/2/be/src/service/query-options.cc@884
PS2, Line 884: 
query_options->__set_parquet_object_store_split_size(parquet_object_store_split_size);
line too long (94 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Comment-Date: Wed, 18 Sep 2019 22:45:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-18 Thread Sahil Takiar (Code Review)
Hello Alex Rodoni, David Rorke, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14247

to look at the new patch set (#2).

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_OBJECT_STORE_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_OBJECT_STORE_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 68 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/14247/2
--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-18 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14247 )

Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..


Patch Set 1:

(1 comment)

First pass, the code looks good. A quick thought on the name of the query 
option. I need to think through the test in test_scanners.py.

http://gerrit.cloudera.org:8080/#/c/14247/1/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/14247/1/be/src/service/query-options.h@192
PS1, Line 192:   QUERY_OPT_FN(parquet_non_block_split_size, 
PARQUET_NON_BLOCK_SPLIT_SIZE,\
Since this is primarily used by object stores (I think the only exception is 
the local filesystem), an alternative is to name this 
PARQUET_OBJECT_STORE_SPLIT_SIZE.



--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Alex Rodoni 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Comment-Date: Wed, 18 Sep 2019 21:02:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8942: Set file format specific split sizes on non-block stores

2019-09-17 Thread Sahil Takiar (Code Review)
Sahil Takiar has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/14247


Change subject: IMPALA-8942: Set file format specific split sizes on non-block 
stores
..

IMPALA-8942: Set file format specific split sizes on non-block stores

On non-block based stores (e.g. S3, ADLS, etc.), the planner creates
split sizes based on the value of FileSystem.getDefaultBlockSize(Path).
This does not work well for Parquet, because the scanners will only
process a split if the data range defined by the split overlaps with
the midpoint of the Parquet row group. This is done to ensure that
scanners treat Parquet row groups as the unit of processing. The default
block size for non-block based stores is typically much lower than the
Parquet row group size. This causes a lot of dummy Parquet splits to be
created and processed, most of which end up doing nothing. The major
issue this causes is skew, and each scanner ends up processing a skewed
amount of data (see IMPALA-3453 for details on the skew issue).

This patch adds a new query option PARQUET_NON_BLOCK_SPLIT_SIZE
(defaults to 256 MB) that controls the size of Parquet splits on
non-block stores.

Impala docs actually recommend setting fs.s3a.block.size to 128 MB
(row group size used by Hive / Spark) or 256 MB (row group size used by
Impala). Setting the block size to the row group size results in ideal
split assignment, but experiments show that using a 256 MB block size
for 128 MB row groups is better than using a 128 MB block size for 256
MB row groups, so the default value of PARQUET_NON_BLOCK_SPLIT_SIZE is
256 MB. Updated the docs accordingly.

Testing:
* Ran core tests
* Added tests to test_scanners.py

Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M docs/topics/impala_s3.xml
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M tests/query_test/test_scanners.py
7 files changed, 68 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/14247/1
--
To view, visit http://gerrit.cloudera.org:8080/14247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0995b2a3b732d39d6f58e9b3bb04111ac04601e6
Gerrit-Change-Number: 14247
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar