[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Reviewed-on: http://gerrit.cloudera.org:8080/9693
Reviewed-by: Zoltan Borok-Nagy 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
14 files changed, 1,002 insertions(+), 113 deletions(-)

Approvals:
  Zoltan Borok-Nagy: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 21
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 20: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 20
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 17 May 2018 20:22:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2493/


--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 20
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 17 May 2018 16:30:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 20: Code-Review+2

Fixed clang-tidy issue, adjusted stats-extrapolation test
Carrying +2


--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 20
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 17 May 2018 16:30:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer, 
Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#20).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
14 files changed, 1,002 insertions(+), 113 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/20
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 20
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 19: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2492/


--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 19
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 17 May 2018 11:37:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 19:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2492/


--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 19
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 17 May 2018 07:46:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 19: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 19
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 16 May 2018 17:01:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-16 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py@137
PS18, Line 137:  alr
> update comment to include that it needs to be an open file handle.
Updated the comment.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 19
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 16 May 2018 16:54:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-16 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#19).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 995 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/19
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 19
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 18: Code-Review+2

(2 comments)

LGTM, had one remaining nit.

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@179
PS17, Line 179:
> The Python sorted() function also has a reversed flag with the same meaning
Thanks for the explanation. Lets keep it as is then.


http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/18/tests/util/get_parquet_metadata.py@137
PS18, Line 137: file
update comment to include that it needs to be an open file handle.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 18
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 16 May 2018 16:39:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-15 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 18:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@71
PS17, Line 71: if column_index_offset and column_index_length:
> It might make things faster / more elegant to pass an open file here, other
Done


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@126
PS17, Line 126: es = page_header.data
> Why not "not page_is_null"?
Guess I wrote too many "is None"...
Done.


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@179
PS17, Line 179:
> "reverse" seems to rely on everyone thinking of ascending as the default. M
The Python sorted() function also has a reversed flag with the same meaning, so 
maybe it's the Python convention. If you still think I should be more explicit, 
I'll change the name of it.

We could use BoundaryOrder, and define it to UNORDERED, but we couldn't really 
use it because BoundaryOrder says something about two lists, not just one.
I.e., testing the following in case of UNORDERED wouldn't be enough:

 is_sorted(min_values, UNORDERED) or
 is_sorted(max_values, UNORDERED)

E.g. if min_values is ascending and max_values is descending the above 
expression would evaluate to false, but the boundary order should be UNORDERED.


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py@163
PS17, Line 163:
> If you follow the suggestion in the other file to pass an open file to read
Done



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 18
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 15 May 2018 16:53:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-15 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#18).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 993 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/18
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 18
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-14 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 17:

(5 comments)

Only a few minor comments, otherwise looks good to me.

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68
PS15, Line 68:   EvalTruncation(normal_plus_max, "abcdeg", 10, UP);
> I had a similar test, when I used the '127' literal.
Thx. I don't feel strongly about using decimal vs hex literals. What I was 
aiming for was the last byte being uchar_max, and the one before being 
schar_max. LGTM now.


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@71
PS17, Line 71: column_index = read_serialized_object(ColumnIndex, 
parquet_file,
It might make things faster / more elegant to pass an open file here, otherwise 
you're opening the same file 3 times for each column. Feel free to ignore 
though as this is not performance critical code.


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@126
PS17, Line 126: page_is_null is False
Why not "not page_is_null"?


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/query_test/test_parquet_page_index.py@179
PS17, Line 179: reverse
"reverse" seems to rely on everyone thinking of ascending as the default. Maybe 
rename parameter to "ascending" or "descending"? Or even pass BoundaryOrder, 
and implement UNORDERED as

  return not is_sorted(, ASCENDING) and not is_sorted(l, DESCENDING)

?


http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/17/tests/util/get_parquet_metadata.py@163
PS17, Line 163: metadata = FileMetaData()
If you follow the suggestion in the other file to pass an open file to 
read_serialized_object, you could use it here, too.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 17
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 14 May 2018 19:44:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-14 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 17:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc@769
PS16, Line 769: if 
(UNLIKELY(!table_sink_mem_tracker_->TryConsume(new_memory_allocation))) {
> Need to increment this after the TryConsume() succeeds, otherwise we'll rel
Ouch, right. Done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68
PS15, Line 68:   EvalTruncation(normal_plus_max, "abcdeg", 10, UP);
> Thanks for adding more tests. I meant the exact 3-character string consisti
I had a similar test, when I used the '127' literal.
I updated it to use your 3-character string.


http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46
PS16, Line 46:   // We convert it to unsigned because signed overflow results 
in undefined behavior.
> I think this could use a comment and/or an explicit cast to make the intent
Added comment and cast.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36
PS16, Line 36:
> update
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55
PS16, Line 55:
> nit: PEP8 says the closing """ should be on a new line. You can use flake8
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90
PS16, Line 90:   column_info = ColumnInfo(schema, stats, offset_index, 
column_index, page_headers)
> I think you should be able to just pass all the parameters to the ctor "Col
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96
PS16, Line 96: e_headers)
> "ColumnInfo objects" or "column infos"
"column infos", done.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97
PS16, Line 97:
> the first row group?
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110
PS16, Line 110: sert previous_loc.offset < current
> You can zip a list with itself to create (prev, cur) pairs:
Nice! Thanks, done.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124
PS16, Line 124: sert page
> nit: page_is_null might be more clear
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182
PS16, Line 182:   else:
> Would it make sense to validate the ordering instead of re-computing it? I.
Yeah, I think it's better to use a different algorithm with the same semantics. 
Rewrote this function.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214
PS16, Line 214:
> unused?
Removed it.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216
PS16, Line 216:  and that th
> Stale comment?
Yes, removed it.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249
PS16, Line 249: else:
> move closer to the comment explaining the intent
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321
PS16, Line 321:  tmp
> nit: end
Done


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348
PS16, Line 348:
  : column =
> I had missed this looking at the truncation code. Why do we keep the zeroes
Yeah, I don't think we need those. Changed the logic.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31
PS16, Line 31: ""
 :   with open(file
> That does not actually seem to be a requirement of the function. Should we
Removed the last two sentences from the comment and renamed the method to 
'read_range'.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@39
PS16, Line 39:   contain a serialized thrift object."""
> Mention "serialized_object" in the comment, or rename the parameter to buff
Renamed the parameter to 'serialized_object_buffer'.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@171
PS16, Line 171: (serialized_thrift_obj
> Can read_serialized_object also create the protocol and return i

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-14 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#17).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 988 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/17
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 17
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 16: Code-Review+1

(1 comment)

LGTM aside from one bug

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc@769
PS16, Line 769: page_index_memory_consumption_ += new_memory_allocation;
Need to increment this after the TryConsume() succeeds, otherwise we'll release 
too much.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 16
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 11 May 2018 00:17:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-09 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 16:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225
PS15, Line 1225:
> Switched to ++i.
Ah, I had overlooked row_group->columns[i]


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33
PS15, Line 33: void EvalTruncation(const string& original, const string& 
expected_result,
> In the test it was easier to organize the code like this.
Let's leave it as it is, I also don't feel strongly about it.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68
PS15, Line 68:   EvalTruncation(normal_plus_max, "abcdeg" + string(4, 0), 10, 
UP);
> Added new tess. Already had tests with 0xFF at L57-L66
Thanks for adding more tests. I meant the exact 3-character string consisting 
of {0, (char)0x7F, (char)0xFF} to make sure that truncating up would hit the 
signed char overflow. Apologies for not being more clear.


http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46
PS16, Line 46:   unsigned char uch = (*result)[i];
I think this could use a comment and/or an explicit cast to make the intent 
clear.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36
PS16, Line 36: PAGE_INDEX_TRUNCATE_LENGTH
update


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55
PS16, Line 55: """
nit: PEP8 says the closing """ should be on a new line. You can use flake8 to 
check the whole file. It was recently added to our codebase: 
https://github.com/apache/impala/commit/2fb73f94b425fde488166a19f78050ddbc3d7b50


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90
PS16, Line 90:   column_info = ColumnInfo._make((schema, stats, 
offset_index, column_index,
I think you should be able to just pass all the parameters to the ctor 
"ColumnInfo(schema, stats,...). Currently you wrap it in a tuple to get an 
iterable to pass to _make(), but that should not be necessary.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96
PS16, Line 96: nametuples
"ColumnInfo objects" or "column infos"


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97
PS16, Line 97: a row group
the first row group?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110
PS16, Line 110: current_loc in page_locations[1:]:
You can zip a list with itself to create (prev, cur) pairs:

 In [6]: l = range(5)

 In [7]: zip(l[:-1], l[1:])
 Out[7]: [(0, 1), (1, 2), (2, 3), (3, 4)]


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124
PS16, Line 124: null_page
nit: page_is_null might be more clear


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182
PS16, Line 182: previous_max = None
Would it make sense to validate the ordering instead of re-computing it? I.e., 
extract all the min_values and max_values into lists and then just all 
assert(sorted(min_values))?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214
PS16, Line 214: sorting_column
unused?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216
PS16, Line 216: skip_col_idx
Stale comment?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249
PS16, Line 249: vector.get_value('exec_option')['num_nodes'] = 1
move closer to the comment explaining the intent


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321
PS16, Line 321: ends
nit: end


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348
PS16, Line 348: followed by
  : # zeroes.
I had missed this looking at the truncation code. Why do we keep the zeroes at 
the end?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31
PS16, Line 31: The range must start with a serialized
 :   thrift object.
That does not actually seem to be a requirement of th

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-09 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 15:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108
PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH
> PAGE_INDEX_MAX_STRING_LENGTH?
Changed the name


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130
PS15, Line 130: Write
> nit: "Writes"
Done, also fixed the previous and the next comment as well.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48
PS15, Line 48: using namespace parquet;
> Can you remove this? I find omitting the parquet:: prefix confusing, and we
I agree, done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@109
PS15, Line 109: HdfsSinkTable
> HdfsTableSink?
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@118
PS15, Line 118: 
table_sink_mem_tracker_->Release(page_index_memory_consumption_);
> Can we do this in Close() instead? We have generally been trying to do reso
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312
PS15, Line 312:   OffsetIndex offset_index_;
> I find it useful to prefix Parquet types with parquet:: and we do that in p
I'm also in favor of prefixing. Done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@318
PS15, Line 318:   MemTracker *table_sink_mem_tracker_;
> nit: MemTracker*
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651
PS15, Line 651:   location.compressed_page_size = -1;
> Probably doesn't have much of an effect, but should we set the size to 0 fo
Yeah, it makes more sense.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668
PS15, Line 668: location.compressed_page_size = 
page.header.compressed_page_size + len;
> Should we add a comment for this line to explain that one includes the leng
Added a comment.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752
PS15, Line 752:   string min_val, max_val;
> I think we usually do one declaration per line, but I don't feel strongly a
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754
PS15, Line 754: Status s_min = TruncateMinValue(page_stats.min_value, 
PAGE_INDEX_TRUNCATE_LENGTH,
> Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was sma
Added a comment.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761
PS15, Line 761: column_index_.null_pages.push_back(true);
> Maybe DCHECK that both are false, to catch errors that result in only one b
Added DCHECK.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@766
PS15, Line 766: Consume
> We prefer TryConsume() and returning a MEM_LIMIT_EXCEEDED error synchronous
Switched to TryConsume() and check its result.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220
PS15, Line 1220:   // Currently we only support Parquet files with one row 
group.
> Can you add a sentence why this particular code relies on this limited supp
Updated the comment.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225
PS15, Line 1225: ++)
> nit: ++i
Switched to ++i.
I need 'i' to index 'columns_' and 'row_group->columns'.
The same stands for the next for-loop.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88
PS15, Line 88:   if (a != a && b != b) return 0;
> Can we use std::isnan()?
Yes, done.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113
PS15, Line 113: Used for merging page statistics into column chunk statistics.
> I think it's better to omit usage from the comment here. Instead say that i
Changed the added comment to your sentence.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164
PS15, Line 164: Let's a
> nit: "We assume..." so it reads less like a suggestions.
Done


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165
PS15, Line 165: If not all values are equal
> What happens for NaN, i.e. when all values are NaN?
In ColumnStats::Merge(), MinMaxTrait::Compare() will compare the NaNs, 
and the cu

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-09 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#16).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 999 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/16
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 16
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-08 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 15:

(25 comments)

Still need to do a pass over the tests.

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108
PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH
PAGE_INDEX_MAX_STRING_LENGTH?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130
PS15, Line 130: Write
nit: "Writes"


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48
PS15, Line 48: using namespace parquet;
Can you remove this? I find omitting the parquet:: prefix confusing, and we 
often explicitly specify it below.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312
PS15, Line 312:   OffsetIndex offset_index_;
I find it useful to prefix Parquet types with parquet:: and we do that in parts 
of this file already. Can you add it in places where it's missing and remove 
the "using" above? Let me know if you feel we shouldn't prefix them.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651
PS15, Line 651:   location.compressed_page_size = -1;
Probably doesn't have much of an effect, but should we set the size to 0 for an 
empty page?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668
PS15, Line 668: location.compressed_page_size = 
page.header.compressed_page_size + len;
Should we add a comment for this line to explain that one includes the length 
and the other doesn't, to make the intent explicit?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752
PS15, Line 752:   string min_val, max_val;
I think we usually do one declaration per line, but I don't feel strongly about 
it.

It also might be safer to enclose this block in a nested scope to prevent 
future editors from re-using min_val and max_val after they have been moved 
below.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754
PS15, Line 754: Status s_min = TruncateMinValue(page_stats.min_value, 
PAGE_INDEX_TRUNCATE_LENGTH,
Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was small 
enough? If so, we should add a check that it isn't, or at least add a comment 
to the definition of PAGE_INDEX_TRUNCATE_LENGTH.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761
PS15, Line 761: column_index_.null_pages.push_back(true);
Maybe DCHECK that both are false, to catch errors that result in only one being 
set?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220
PS15, Line 1220:   // Currently we only support Parquet files with one row 
group.
Can you add a sentence why this particular code relies on this limited support?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225
PS15, Line 1225: ++)
nit: ++i

Or use a range based loop like you do in L1254?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88
PS15, Line 88:   if (a != a && b != b) return 0;
Can we use std::isnan()?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113
PS15, Line 113: Used for merging page statistics into column chunk statistics.
I think it's better to omit usage from the comment here. Instead say that it 
maintains internal state that tracks whether the values are ordered.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164
PS15, Line 164: Let's a
nit: "We assume..." so it reads less like a suggestions.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165
PS15, Line 165: If not all values are equal
What happens for NaN, i.e. when all values are NaN?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@54
PS15, Line 54: if (MinMaxTrait::Compare(prev_page_max_value_, 
cs->max_value_) > 0 ||
I cannot convince myself that prev_page_max_value_ has to be initialized here. 
Is it?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@66
PS15, Line 66: Update(cs->min_value_, cs->max_value_);
If you do the order checks after this line, 

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 15:

(5 comments)

Thanks for your patience. I'm read to +2 after a couple of tweaks to the 
MemTracker stuff.

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@109
PS15, Line 109: HdfsSinkTable
HdfsTableSink?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@118
PS15, Line 118: 
table_sink_mem_tracker_->Release(page_index_memory_consumption_);
Can we do this in Close() instead? We have generally been trying to do resource 
releasing in Close() so that the sequencing of resource release is more 
explicit (goes against RAII but there are pros and cons of that).


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@318
PS15, Line 318:   MemTracker *table_sink_mem_tracker_;
nit: MemTracker*


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@766
PS15, Line 766: Consume
We prefer TryConsume() and returning a MEM_LIMIT_EXCEEDED error synchronously 
so that we don't depend on the memory limit being enforced asynchronously.

HdfsTableSink is a bad offender in terms of using Consume(), which we need to 
fix at some point.


http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql
File testdata/bin/load-dependent-tables.sql:

http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77
PS13, Line 77: (cs CHAR(5), cl CHAR(140), vc VARCHAR(32))
> My bad, I thought if I run a complete data load after formatting, it will e
I ended up running these statements manually via copy-and-paste into beeline. I 
think this should run during pre-commit if we do a full data load.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 08 May 2018 22:14:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-07 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 13:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195
PS2, Line 1195:   // Populate RowGroup::sorting_columns with all columns 
specified by the Frontend.
> I didn't follow up on this. It looks like the code only ever writes one row
Unfortunately it wouldn't work, because based on the description of the Parquet 
page indexes 
(https://github.com/apache/parquet-format/blob/master/PageIndex.md) we should 
write all the page indexes of all the row groups together, just before the 
footer. In other words, we have to flush all the row groups, then we can write 
all the page indexes.

Currently Impala writes only one row group per Parquet file, so I just add a 
DCHECK here. If Impala's behavior changes in the future in this regard, we'll 
need to update the write path of the page indexes.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107
PS13, Line 107:   page_index_mem_tracker_(-1, "Parquet Page Index",
> We can have many HdfsTableWriters per HdfsTableSink if we're doing non-clus
I track directly against HdfsTableSink.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212
PS13, Line 1212:   if (file_metadata_.row_groups.size() != 1) return 
Status::OK();
> This
Answered it at PS2.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214
PS13, Line 1214:   parquet::RowGroup* row_group = 
&(file_metadata_.row_groups[0]);
> Can't we use current_row_group_ here?
We can't, detailed answer is at PS2.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216
PS13, Line 1216:   for (int i = 0; i < columns_.size(); i++) {
> I injected an error here by forgetting to write out one of the pages and th
Thanks for checking.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241
PS13, Line 1241: row_group->columns[i].__set_offset_index_offset(file_pos_);
> I added some errors to the offsets and lengths - looks like the tests caugh
Thanks


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31
PS13, Line 31:   ascending_boundary_order_ = true;
> I tried injecting a bug by initializing these to "false". The tests didn't
Currently we don't invoke Reset() on the ColumnStatsBase object where it would 
matter.

In HdfsParquetTableWriter::BaseColumnWriter there are two ColumnStatsBase 
members: page_stats_base_ and row_group_stats_base_.

For each new page, we invoke Reset() on page_stats_base_, but that doesn't 
corrupt the ordering, because we only consider the state of 
row_group_stats_base_ for that, and we never call Reset() on it.

Even in BaseColumnWriter::Reset(), we throw away the old row_group_stats_base_ 
and create a brand new one.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56
PS13, Line 56: false
> I tried injecting a bug by flipping the value of this and the tests caught
Thanks!


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60
PS13, Line 60: if (MinMaxTrait::Compare(prev_page_max_value_, 
cs->max_value_) < 0 ||
> I tried injecting a bug by removing the first condition and the tests caugh
Thanks!


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68
PS13, Line 68: max_value_
> I tried switching this to cs->min_value. The tests caught it.
Thanks!


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69
PS13, Line 69: prev_page_min_buffer_.Clear();
> I tried removing this line. The tests didn't catch it, even under ASAN.
Turned out that the _validate_ordering function was too loose in 
test_parquet_page_index.py. We could remove both prev_page_min_buffer_ and 
prev_page_max_buffer_ without noticing it.

Now _validate_ordering implements the same logic from parquet-column-stats to 
determine the expected ordering.

Interestingly the tests only catch the case when we remove 
prev_page_max_buffer_.Clear(). The lack of prev_page_min_buffer.Clear() 
remained unnoticed. I tried to run the tests on a couple of existing tables, 
but without any luck.

The reason is a bit subtle, basically prev_page_min_value_.ptr is "lucky" 
enough to point at memory addr

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-07 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#15).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 923 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/15
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-05-07 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#14).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 923 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/14
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 14
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-25 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 13:

(15 comments)

I spent a bit of time trying to break the code by injecting bugs and it looks 
like the test coverage is pretty good - there were a couple of things it didn't 
catch though that might be worth investigating a little.

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195
PS2, Line 1195:   // Populate RowGroup::sorting_columns with all columns 
specified by the Frontend.
> AFAIK we always write only one row group. I just copy-pasted this part from
I didn't follow up on this. It looks like the code only ever writes one row 
group but it's somewhat generalized to handle multiple row groups. I think 
people might have used this in the past to generate custom files. If we ever 
wrote out multiple row groups, we'd have to test it properly.

In the meantime I think we could remove this if and just use current_row_group_ 
below instead of hardcoding [0] and it would all work ok.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@107
PS13, Line 107:   page_index_mem_tracker_(-1, "Parquet Page Index",
We can have many HdfsTableWriters per HdfsTableSink if we're doing non-cluster 
inserts.

As a rule of thumb, I think we should have O(1) MemTrackers per ExecNode or 
DataSink - O(n) or greater could cause problems. Otherwise the MemTracker dumps 
could get huge and cause problems if we want to serialize them into a string.

We could have one "Parquet Page Indices" MemTracker per HdfsTableSink or we 
could just track directly against the HdfsTableSink MemTracker.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1212
PS13, Line 1212:   if (file_metadata_.row_groups.size() != 1) return 
Status::OK();
This


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1214
PS13, Line 1214:   parquet::RowGroup* row_group = 
&(file_metadata_.row_groups[0]);
Can't we use current_row_group_ here?


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1216
PS13, Line 1216:   for (int i = 0; i < columns_.size(); i++) {
I injected an error here by forgetting to write out one of the pages and the 
tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/hdfs-parquet-table-writer.cc@1241
PS13, Line 1241: row_group->columns[i].__set_offset_index_offset(file_pos_);
I added some errors to the offsets and lengths - looks like the tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@31
PS13, Line 31:   ascending_boundary_order_ = true;
I tried injecting a bug by initializing these to "false". The tests didn't 
catch it. Would be good to add the coverage but this seems less bad of a 
coverage gap than the opposite (claiming an order that is not valid).


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@56
PS13, Line 56: false
I tried injecting a bug by flipping the value of this and the tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@60
PS13, Line 60: if (MinMaxTrait::Compare(prev_page_max_value_, 
cs->max_value_) < 0 ||
I tried injecting a bug by removing the first condition and the tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@68
PS13, Line 68: max_value_
I tried switching this to cs->min_value. The tests caught it.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/exec/parquet-column-stats.inline.h@69
PS13, Line 69: prev_page_min_buffer_.Clear();
I tried removing this line. The tests didn't catch it, even under ASAN.


http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/13/be/src/util/string-util.cc@29
PS13, Line 29: int
nit: use static_cast<>


http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql
File testdata/bin/load-dependent-tables.sql:

http://gerrit.cloudera.org:8080/#/c/9693/13/testdata/bin/load-dependent-tables.sql@77
PS13, Line 77: WITH DEFERRED REBUILD IN TABLE hive_index_tbl
missing trailing semicolon here and on some of the "DROP TABLE" statements 
below -I tried to run this file locally and it failed.


http://gerrit.cloudera.org:8080/#/c/9693/13/te

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-25 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 13:

(5 comments)

Thanks, Csaba!
Yes, we have tests for the edge cases in string-util-test.cc. And also added a 
new python test function 'test_max_string_values'

http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@87
PS12, Line 87: // Base class for column writers. This contains most of the 
logic except for
> If different value is used for page and column stats, then the constants sh
Moved it to HdfsParquetTableWriter and renamed it to PAGE_INDEX_TRUNCATE_LENGTH.


http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@749
PS12, Line 749: Status s_max = TruncateMaxValue(page_stats.max_value, 
PAGE_INDEX_TRUNCATE_LENGTH,
  : &max_val);
  : if (!s_min.ok() || !s_max.ok()) valid_page_index_ = false;
  : column_index_.
> I can't find TruncateMinValue and TruncateMaxValue in code.
Sorry, I've uploaded it now.


http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@1221
PS12, Line 1221: // We always set null_counts.
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@171
PS12, Line 171: if (isinstance(page_max_value, basestring) and
> The != should be < - what happens if column_max_value (e.g. "bb") is shorte
Oops, you are absolutely right.
I use the truncate length now.


http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@172
PS12, Line 172: len(page_max_value) == PAGE_INDEX_TRUNCATE_LENGTH):
> This may not be true if the last two characters contain the max char value.
Fixed that as well.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 25 Apr 2018 10:06:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-25 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#13).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
A be/src/util/string-util-test.cc
A be/src/util/string-util.cc
A be/src/util/string-util.h
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
13 files changed, 864 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/13
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 13
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-24 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 12:

(5 comments)

Do we have tests for the edge cases? For example a table with columns of 
63/64/65 length strings.

http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@87
PS12, Line 87: const int MAX_STAT_VALUE_LENGTH = 64;
If different value is used for page and column stats, then the constants should 
be placed close to each other and their name should reflect their role.


http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@749
PS12, Line 749: Status s_min = TruncateMinValue(page_stats.min_value, 
MAX_STAT_VALUE_LENGTH,
  : &min_val);
  : Status s_max = TruncateMaxValue(page_stats.max_value, 
MAX_STAT_VALUE_LENGTH,
  : &max_val);
I can't find TruncateMinValue and TruncateMaxValue in code.


http://gerrit.cloudera.org:8080/#/c/9693/12/be/src/exec/hdfs-parquet-table-writer.cc@1221
PS12, Line 1221: 
column.column_index_.__set_boundary_order(column.row_group_stats_base_->GetBoundaryOrder());
nit: long line


http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@171
PS12, Line 171: len(page_max_value) != len(column_max_value)):
The != should be < - what happens if column_max_value (e.g. "bb") is shorter 
then page_max_value (e.g "")?

I would also consider using the exact max length used for page stats, to make 
the tests as strict as possible.


http://gerrit.cloudera.org:8080/#/c/9693/12/tests/query_test/test_parquet_page_index.py@172
PS12, Line 172:   assert page_max_value[:-1] <= 
column_max_value[:len(page_max_value) - 1]
This may not be true if the last two characters contain the max char value.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 24 Apr 2018 10:42:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-23 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 10:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector min_values_;
> Options 1 or 2 seem fine to me.
Chose option 1, but also added a memtracker.


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735
PS10, Line 735: min_values_.push_back(std::string(""));
> I don't know if we need the call to std::string() here, I think it should w
Right. Now the code looks a bit different, and I call push_back after the 'if'.


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227
PS10, Line 1227:   for (auto& column : columns_) {
> nit: can fit loop on one line.
Done


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159
PS10, Line 159:   // If true, min/max values are ascending.
> Maybe briefly mention why they both start off true? And both can be true at
Added some explanation here and to GetBoundaryOrder().


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37
PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
> We've got a lot of good coverage in this test.
I agree. It would be best to have a fully-fledged parquet reader implementation 
in Python that we could easily use.
I only found https://pypi.org/project/parquet/ but it seems uncomplete.

Or, maybe we can hard-code the expected values for a concrete parquet file.


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177
PS10, Line 177: previouse_value
> typo in variable name
Done


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205
PS10, Line 205: falied
> nit: failed
Done


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205
PS10, Line 205: column_info_schema
> this variable isn't defined - did you mean column_info?
Done


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244
PS10, Line 244: chars_formats
> chars_formats is weird in that it's created by a different test - TestChars
I moved table creation to 'testdata/bin/load-dependent-tables.sql'.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 23 Apr 2018 15:50:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-23 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#12).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/util/CMakeLists.txt
M common/thrift/parquet.thrift
M testdata/bin/load-dependent-tables.sql
M tests/query_test/test_chars.py
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
10 files changed, 641 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/12
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector min_values_;
> Could we not just track the size of page_stats.min_value in a memtracker in
Options 1 or 2 seem fine to me.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 19 Apr 2018 16:13:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector min_values_;
> I had a discussion with the Parquet guys who implement the page indexes for
Could we not just track the size of page_stats.min_value in a memtracker in 
FinalizeCurrentPage when adding elements? Then we can free it after flushing 
out the row group.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 16 Apr 2018 16:34:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-16 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector min_values_;
> I'm still concerned about the amount of untracked memory from min_values_ a
I had a discussion with the Parquet guys who implement the page indexes for 
parquet-mr and they plan to truncate the min/max values at 64 bytes. It seems 
reasonable because it is unlikely that users want to filter against very long 
string data.

Maybe we could also use 64 bytes by default and let the users override it with 
an advanced query option.
(specifying different sizes for different columns would make more sense, but I 
have no idea how to do that)

If we go with this default, do you still think it will end up using too much 
untracked memory?

As you wrote, if we start using StringValue/StringBuffer, we will need to 
convert min/max values to vectors of strings when we write out the column 
index, which is a lot of unnecessary copying and a short temporary peak of 
untracked memory.
Or, to avoid the untracked memory peak, we can implement our own ColumnIndex 
class that uses StringBuffers and has a 'write(TProtocol *prot)' method. I 
think we can't avoid copying everything, because the TProtocol class only 
accepts std::strings for writing binary data.

To summarize, I can only see 3 options, all with some drawbacks:

* use std::strings

pro: code is straight-forward
pro: no copying
con: lot of untracked memory

* use StringBuffers

pro: avoids long-living untracked memory
con: lots of copying
con: short-living untracked memory

* implement our own ColumnIndex class

pro: avoids untracked memory
con: lots of copying
con: need to maintain that class

Am I missing something? Which option do you prefer?



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 16 Apr 2018 13:48:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-04-11 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 10:

(9 comments)

Overall this is looking good. I had some specific concerns about some of the 
nitty-gritty details.

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector min_values_;
I'm still concerned about the amount of untracked memory from min_values_ and 
max_values_, even if we truncate the string values to 1KB or similar - it seems 
like could end up with multiple MB of untracked memory. We could probably live 
with it since it's smaller than the actual data, but it's a step in the wrong 
direction.

Maybe we could store min_values_ and max_values_ as StringValues backed by 
memory per_file_mem_pool_ and then only convert to strings when writing out 
each column to the page index?


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735
PS10, Line 735: min_values_.push_back(std::string(""));
I don't know if we need the call to std::string() here, I think it should work 
if we just emplace_back() to instantiate an empty string.


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227
PS10, Line 1227:   for (auto& column : columns_) {
nit: can fit loop on one line.


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159
PS10, Line 159:   // If true, min/max values are ascending.
Maybe briefly mention why they both start off true? And both can be true at the 
same time? It's slightly subtle.


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37
PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
We've got a lot of good coverage in this test.

I'm wondering if we're missing some basic tests that confirm that the values in 
the page match the min/max values in the page index. It seems like these 
validations might not catch some kinds of bugs. E.g. min/max values in the 
index are somehow out-of-sync with the pages. Most bugs that I can imagine 
would get caught by one validation or another but it would be nice to have a 
sanity test where we confirm that the values in each page match the values in 
the page index.


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177
PS10, Line 177: previouse_value
typo in variable name


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205
PS10, Line 205: falied
nit: failed


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205
PS10, Line 205: column_info_schema
this variable isn't defined - did you mean column_info?


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244
PS10, Line 244: chars_formats
chars_formats is weird in that it's created by a different test - 
TestCharsFormats. I.e. it's not present unless that test ran before this one. 
Maybe we should change it so that the table is loaded during normal data 
loading?



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 12 Apr 2018 00:23:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-29 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> About the formula for prefix size: what would happen, if there is no common
The formula was only a conversation-starter, but the situation is actually 
worse. On yesterday's Parquet sync meeting it turned out that it is a bad idea 
to have different limits for each page, here's why:

Let's assume we have the following pages:

 Page 0:
 12345612
 12345613
 12345614
 min_value: 12345612
 max_value: 12345614
 Page 1:
 12345615
 23456789
 min_value: 123
 max_value: 235

Note that the values are in ascending order, but Page 0's min value is greater 
than Page 1's min_value.


I'll investigate the second part of your comment.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163:
 :
> I was thinking about catching the exception in _validate_parquet_page_index
I added the column name. I might add the filename as well in a later commit.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 30 Mar 2018 00:05:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-29 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#9).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
7 files changed, 570 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/9
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-29 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> About leaving max_value unset: we cannot really do that, because 'max_value
About the formula for prefix size: what would happen, if there is no common 
prefix? If num_values is small (like 64 in the plain encoded 1k length case), 
the result would be 0.

About the untracked memory usage: the the min/max values could be collected by 
page_stats_base_, which knows the type of the column, so it wouldn't need to 
store fixed length data as string. When a data page is over, it could write its 
min/max value to a buffer. The min/max values would still had to be converted 
to string for thrift serialization, but columns could do this one-by-one, so 
their peak heap usage wouldn't be added together.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163:   if not null_page:
 : page_max_value = decod
> I added the column name to this specific error message.
I was thinking about catching the exception in _validate_parquet_page_index, 
adding the column name to its message or printing it, and then re-throwing the 
AssertionError.

The same could be done for page index by using an iterator that is held in a 
member / argument, so its state is still intact in the caller function when the 
exception is caught, and using it in every loop to iterate over pages.

The iterator stuff does probably not worth the effort, but knowing the name of 
the offending column could be really helpful if the test catches an error.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 29 Mar 2018 19:44:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-28 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163: for null_page in null_pages:
 :   assert null_page
> It would be nice to always add the column name and the page number (if appl
I added the column name to this specific error message.

Adding that information to all the asserts would need a lot of code 
modifications and I'm not sure that it would worth the effort.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 28 Mar 2018 18:06:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-28 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#8).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
7 files changed, 566 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/8
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-28 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131
PS6, Line 131:
> You meant validate_page_locations(), right?
No, I meant the whole body of the for loop, but I am satisfied with the new 
structure, the code looks easier to read to me. :)


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163:   if not null_page:
 : page_max_value = decod
> OK, used enumerate, and added a message to the assert.
It would be nice to always add the column name and the page number (if 
applicable) to the error message, but I do not know how do to do that without a 
lot of extra code.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 28 Mar 2018 15:53:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-27 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 6:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> If all strings in a page have the same 1KB prefix, then performance and spa
About leaving max_value unset: we cannot really do that, because 'max_values' 
is a required list of binary data. We can only set it to empty string, but an 
empty string can also be a valid value. And an empty string is less than all 
strings, so not really a good choice.


About determining the prefix size: why don't we calculate it dynamically based 
on the actual data and with some heuristic? E.g for max_value.:

 min_element(
   length(max_value),
   length(common_prefix(min_value, max_value)) * 1.5 + num_values / 100,
   4 kB
 )


http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc@727
PS6, Line 727:   // max_values_ and mark the values as valid. In case min and 
max values are not
> It is a bit misleading to by talking about marking the min/max values valid
Yeah, I agree that the comment is misleading, so I updated it.

Currently the write path logic is reversed however. We mark the page as null if 
it doesn't have min and max values, and we mark it as not-null if it has min 
and max values as well.


http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h@54
PS6, Line 54: if (MinMaxTrait::Compare(prev_page_max_value_, 
cs->max_value_) == 1 ||
> I know it is not Java but it might be readable to use a java-like contract
I changed the ifs to use < and > against 0 to allow some flexibility for the 
Compare() implementations.

However, for now I leave the definition of Compare as it is, because I can't 
use a - b for string values, therefore I'd need to write one more template 
specialization for strings.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131
PS6, Line 131:   for column_info in columns:
> The next block is a good candidate to move to a separate function.
You meant validate_page_locations(), right?


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134
PS6, Line 134: previous_loc = column_info.offset_index.page_locations[0]
 : for current_loc in 
column_info.offset_index.page_locations[1:]:
 :   assert previous_loc.offset < current_loc.offset
 :   assert previous_loc.first_row_index < 
current_loc.first_row_index
 :   previous_loc = current_loc
> I would move this to a separate function like "validate_page_locations".
Done


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139
PS6, Line 139: null_pages = column_info.column_index.null_pages
> The next empty line could be moved before this line to group it together wi
I refactored a lot, this comment doesn't really apply anymore.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157
PS6, Line 157: if column_info.stats:
> This is more subjective than the comment in line 131, but I would move the
I created validate functions for the members of ColumnIndex and OffsetIndex.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163: for null_page in null_pages:
 :   assert null_page
> Python has any() and all(). However, it might actually be better to use enu
OK, used enumerate, and added a message to the assert.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166
PS6, Line 166: return
> This "return" should be "continue", but it would be even better to move to
Right, now it's in a different function, called _validate_min_max_values


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169
PS6, Line 169:   for null_page, value in zip(null_pages, min_values):
 : if not null_page:
 :   current_value = 
decode_stats_value(column_info.schema, value)
 :   assert current_value >= min_value
  

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-27 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Anonymous Coward #248, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#7).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
7 files changed, 566 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/7
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-27 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> About long strings: I think that it would be safer to start with a smaller
If all strings in a page have the same 1KB prefix, then performance and space 
consumption is still likely to be dominated by those. If we want to cap at a 
small value, we will face this for much less extreme workloads and not writing 
column stats becomes more of an issue. If you want to dive into this, please do 
comprehensive experiments to determine a good size.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163: for null_page in null_pages:
 :   assert null_page
> This could be "assert False not in null_pages". Creating a variable to stor
Python has any() and all(). However, it might actually be better to use 
enumerate() and print the index if a page is not null to make debugging easier.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 27 Mar 2018 17:49:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-27 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131
PS6, Line 131:   for column_info in columns:
The next block is a good candidate to move to a separate function.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134
PS6, Line 134: previous_loc = column_info.offset_index.page_locations[0]
 : for current_loc in 
column_info.offset_index.page_locations[1:]:
 :   assert previous_loc.offset < current_loc.offset
 :   assert previous_loc.first_row_index < 
current_loc.first_row_index
 :   previous_loc = current_loc
I would move this to a separate function like "validate_page_locations".


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139
PS6, Line 139: null_pages = column_info.column_index.null_pages
The next empty line could be moved before this line to group it together with 
the related asserts.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157
PS6, Line 157: if column_info.stats:
This is more subjective than the comment in line 131, but I would move the next 
block to a separate function like "validate_index_stats_consistency".


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163: for null_page in null_pages:
 :   assert null_page
This could be "assert False not in null_pages". Creating a variable to store 
whether all pages are null could be useful to check that if all pages are null, 
the min and max stats should not be set.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166
PS6, Line 166: return
This "return" should be "continue", but it would be even better to move to 
implement the suggestion in line 131 and keep this as a "return".


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169
PS6, Line 169:   for null_page, value in zip(null_pages, min_values):
 : if not null_page:
 :   current_value = 
decode_stats_value(column_info.schema, value)
 :   assert current_value >= min_value
 :
 :   max_value = decode_stats_value(column_info.schema, 
max_value_str)
 :   for null_page, value in zip(null_pages, max_values):
 : if not null_page:
 :   current_value = 
decode_stats_value(column_info.schema, value)
 :   assert current_value <= max_value
These two for loops could be merged. I would also prefer more exact variable 
names, like min/max_value_in_page instead of current_value, min/max_value in 
column_chunk instead of min/max_value.

Another test ideas:

- the sum of page null counts should be equal to the total null count

- in most cases the minimum/maximum among the page min/max values should be 
equal to the column chunk's min/max value. As all the tests are checking tables 
created by Impala, we can be stricter/more specific than the Praquet spec. 
Special columns where this is not true could be listed, or their expected 
values could be set in a map to check the logic behind these values.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 27 Mar 2018 16:15:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-27 Thread Anonymous Coward (Code Review)
Anonymous Coward #248 has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 6:

(2 comments)

I have 2 minor findings only. Otherwise, LGTM. (I'm not an Impala nor a C/C++ 
expert, though.)

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc@727
PS6, Line 727:   // max_values_ and mark the values as valid. In case min and 
max values are not
It is a bit misleading to by talking about marking the min/max values 
valid/invalid. It is about the null pages which case the min/max values do not 
make sense so we store empty strings instead.


http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h@54
PS6, Line 54: if (MinMaxTrait::Compare(prev_page_max_value_, 
cs->max_value_) == 1 ||
I know it is not Java but it might be readable to use a java-like contract for 
compare. It says compare(a, b) < 0 if a < b etc.
It also might have performance benefit in some cases if you simply return a - b 
instead of if-else.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 27 Mar 2018 12:32:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-26 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 5:

(1 comment)

Thanks, I fixed a few other comments as well.

http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h@162
PS5, Line 162: i
> nit: Capital I
Done



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 27 Mar 2018 00:41:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-26 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#6).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
7 files changed, 547 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/6
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-26 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/5/be/src/exec/parquet-column-stats.h@162
PS5, Line 162:
nit: Capital I



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 26 Mar 2018 19:17:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-26 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 5:

(16 comments)

Thanks Csaba!
(and I also updated my vimrc... :) )

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@159
PS4, Line 159:   // If true, min/max values are ascending.
> nit: missing .
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@162
PS4, Line 162:   // if true min/max values are descending.
> nit: missing .
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h@207
PS4, Line 207:   if (prev_page_min_buffer_.IsEmpty()) {
 : RETURN_IF_ERROR(CopyToBuffer(&prev_page_min_buffer_, 
&prev_page_min_value_));
> nit: long lines
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@71
PS4, Line 71: column_index_length)
> nit: not enough indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@82
PS4, Line 82: offset_index_length)
> nit: not enough indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@85
PS4, Line 85:   page_header = get_page_header(parquet_file, 
page_loc.offset,
:   page_loc.compressed_page_size)
:   page_headers.append(page_header)
> nit: too much indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@113
PS4, Line 113: for null_page, value in zip(null_pages, values):
> Are these prints intentional?
No, jus left them here.


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@118
PS4, Line 118: elif ordering == BoundaryOrder.DESCENDING and 
previouse_value is not None:
> Is this print intentional?
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@120
PS4, Line 120: previous_value = current_value
> nit: too much indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@122
PS4, Line 122:
> nit: too much indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@164
PS4, Line 164:   assert null_page
 : # Everything is None, no further checks needed.
 : return
 :
 :   min_value = decode_stats_value(column_info.schema, 
min_value_str)
 :   for null_p
> nit: too much indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@200
PS4, Line 200:   ).format(qualified_table_name, sorting_column, 
source_table)
> nit: not enough indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@203
PS4, Line 203: self._validate_parquet_index(hdfs_path, sorting_column, 
tmpdir)
> nit: not enough indentation
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@244
PS4, Line 244:
> nit: missing .
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@249
PS4, Line 249: self._ctas_table_and_verify_index(vector, unique_database,
> nit: missing .
Done


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py@169
PS4, Line 169: def get_column_index(filename, file_pos, length):
 :   """Reads a ColumnIndex object from a file at the given 
position."""
 :   serialized_column_index = read_serialized_object(filename, 
file_pos, length)
 :   protocol = create_protocol(serialized_column_index)
 :   column_index = ColumnIndex()
 :   column_index.read(protocol)
 :   return column_index
 :
 :
 : def get_page_header(filename, file_pos, length):
 :   """Reads a PageHeader object from a file at the given 
position."""
 :   serialized_page_header = read_serialized_object(filename, 
file_pos, length)
 :   protocol = create_protocol(serialized_page_header)
 :   page_header = PageHeader()
 :   page_header.read(protocol)

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-26 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#5).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
7 files changed, 547 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/5
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-26 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 4:

(17 comments)

I have added a lot of nits, feel free to skip them for now if you think that it 
is too early to deal with such things at the moment.

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> OK, I'll implement this part after that.
About long strings: I think that it would be safer to start with a smaller max 
size than 1KB, for example 64, to avoid making inserts slower / the written 
files larger. Truncating long string can only make performance worse if a query 
contains =/ comparison with a string longer than the truncated length, which 
seems a non-typical use case to me.


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@159
PS4, Line 159:   // If true, min/max values are ascending
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@162
PS4, Line 162:   // if true min/max values are descending
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h@207
PS4, Line 207:   if (prev_page_min_buffer_.IsEmpty()) 
RETURN_IF_ERROR(CopyToBuffer(&prev_page_min_buffer_, &prev_page_min_value_));
 :   if (prev_page_max_buffer_.IsEmpty()) 
RETURN_IF_ERROR(CopyToBuffer(&prev_page_max_buffer_, &prev_page_max_value_));
nit: long lines


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@71
PS4, Line 71:   column_index_length)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@82
PS4, Line 82: offset_index_length)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@85
PS4, Line 85: page_header = get_page_header(parquet_file, 
page_loc.offset,
: page_loc.compressed_page_size)
: page_headers.append(page_header)
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@113
PS4, Line 113: print schema
Are these prints intentional?


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@118
PS4, Line 118: print current_value
Is this print intentional?


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@120
PS4, Line 120: assert current_value >= previous_value
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@122
PS4, Line 122: assert current_value <= previous_value
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@164
PS4, Line 164:   # If either is None, then both need to be None.
 :   assert min_value_str is None and max_value_str is 
None
 :   for null_page in null_pages:
 :   assert null_page
 :   # Everything is None, no further checks needed.
 :   return
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@200
PS4, Line 200: qualified_table_name, source_table)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@203
PS4, Line 203: ).format(qualified_table_name, sorting_column, 
source_table)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@244
PS4, Line 244: """Test table with wide row"""
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@249
PS4, Line 249: """Test tables with wide rows and many columns"""
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py@169
PS4, Line 169: def get_column_index(filename, file_pos, length):
 : """Reads a 

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-23 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#4).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
M tests/util/get_parquet_metadata.py
7 files changed, 546 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/4
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-23 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 2:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> Just saw Tim's comment. I think we discussed this a while ago in the Parque
OK, I'll implement this part after that.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
> Sry for the confusion. The spec says that this must be set if and only if a
Oh, sure.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195
PS2, Line 1195:   // If the file contains more than one row_group, don't write 
the index.
> I missed this earlier - this seems weird. Do we ever write more than one ro
AFAIK we always write only one row group. I just copy-pasted this part from 
Pooja's work. Should I substitute it to a DCHECK maybe?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
> That sounds good to me.
Great, I implemented it that way.


http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211
PS3, Line 1211: // Let's see if there's an ordering on min/max values.
> After moving column.min_values_ above it is invalid here. It also leaves th
We are using columns_ in the next section to write the offset index, so can't 
reset them here, but at the end of the function.

I only call BaseColumnWriter::Reset() on them, because unique_ptr::reset 
generates segfaults later.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55
PS3, Line 55:
> We should assert that the schema is flat, e.g. by comparing lengths of row_
Added an assert.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63
PS3, Line 63:
> You should not need "is not None" here and elsewhere. The parentheses also
I removed it from couple of places.
Kept them in asserts, because I think it is more readable that way.
Also kept for "if max_value is not None" and such, because max_value can be set 
to 0.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84
PS3, Line 84:
> This looks complicated enough to warrant a namedtuple, maybe even a class.
I chose namedtuple


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87
PS3, Line 87:
> ...row_groups... (plural)
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90
PS3, Line 90:
> this comment seems wrong
Removed the last sentence.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103
PS3, Line 103:
> nit: All comments should follow PEP8 class comment format.
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118
PS3, Line 118:
> long line
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122
PS3, Line 122:
> It seems that this method is only called once, and without skip_col_idxs. R
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147
PS3, Line 147:
> Impala currently does not write these, add an assertion
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164
PS3, Line 164:
> What happens if it is None?
Added some extra asserts about it. If either stats.min_value or stats.max_value 
is None, the other must be None as well. And all pages need to be null.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165
PS3, Line 165:
> Don't overwrite the variable
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@205
PS3, Line 205:
> copy-paste error?
yup, done.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@230
PS3, Line 230:
> We also should have a test with a file that has 4K of 0xFF in it, just to m
Do you know a quick way of generate a parquet file like that?

I added the other tests.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py
File tests/util/get_parquet_index.py:

PS3:
> Maybe we should merge this file with get_parquet_metadata.py (and clean it
Moved the contents to get_parquet_metadata.


http://gerrit.cl

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-20 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> I think that null_pages should also have covered non-null pages with invali
Just saw Tim's comment. I think we discussed this a while ago in the Parquet 
sync and people were opposed to amending the standard for this case. I might be 
wrong though, let's check in the next meeting.



-- 
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 20 Mar 2018 23:03:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-20 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 3:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> Tim:
I think that null_pages should also have covered non-null pages with invalid 
stats, but the current spec doesn't seem to allow that. In light of that I see 
two options:

1) Store more than 4KB of the pathological value until a non-0xFF character. I 
think we don't allow unlimited Parquet footer sizes, but I cannot find the code 
that enforces this right now.

2) Not store a ColumnIndex for that column at all.

I am in favor of 2). This case seems highly hypothetical to me and we really 
just need to deal with it in a standard-conforming way. I don't think we need 
to care about the actual performance in this particular scenario.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
> I'm not sure if I've found the relevant comment.
Sry for the confusion. The spec says that this must be set if and only if all 
values are null. Can you add a DCHECK to make sure that the null_count is equal 
to the number of values in the page?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: n
> I just realized that the min/max values are only strings at this point, at
That sounds good to me.


http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211
PS3, Line 1211: for (int j = 0; j < column.min_values_.size() - 1; ++j) {
After moving column.min_values_ above it is invalid here. It also leaves the 
whole BaseColumnWriter in an inconsistent state and we risk to introduce errors 
in the future.

To improve this we could wrap the moves in a sub-scope that starts with 
dereferencing columns_[i] and ends with the move. We should then reset the 
column in columns_ and leave the array with NULL elements. That will require to 
make it very obvious that this happens through the name and comment of this 
function.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55
PS3, Line 55: # We only support flat schemas, the additional element is the 
root element.
We should assert that the schema is flat, e.g. by comparing lengths of 
row_group.columns and schemas.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63
PS3, Line 63:  is not None
You should not need "is not None" here and elsewhere. The parentheses also 
shouldn't be necessary.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84
PS3, Line 84:   row_group_index.append((schema, stats, offset_index, 
column_index, page_headers))
This looks complicated enough to warrant a namedtuple, maybe even a class.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87
PS3, Line 87: get_row_group
...row_groups... (plural)


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90
PS3, Line 90: containing row groups and filename
this comment seems wrong


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103
PS3, Line 103:
nit: All comments should follow PEP8 class comment format.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118
PS3, Line 118:   def _validate_parquet_index(self, hdfs_path, sorting_column, 
tmpdir, skip_col_idxs = None):
long line


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122
PS3, Line 122: skip_col_idxs = skip_col_idxs or []
It seems that this method is only called once, and without skip_col_idxs. 
Remove the parameter?


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147
PS3, Line 147: num_values = 
page_header.data_page_header_v2.num_values
Impala currently does not write these, add an assertion


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164
PS3, Line 164: if max_value is not None:
What happens if it is None?


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165
PS3, Line 165:   max_value = decode_stats_value(schema, max_value)
Don't overwrite the variable


http://gerr

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> Tim:
Yeah, maybe we should combine the code into a shared utility function, since I 
think the core logic is the same.

IMO the sensible thing is to leave the max unset in that case.  Hopefully we 
can amend the spec to allow that.

I think technically now our choices are to drop the entire column index or 
write out a huge value. Dropping the index wouldn't be a total disaster, since 
this should be a rare edge case, but it is kind of crappy.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195
PS2, Line 1195:   // If the file contains more than one row_group, don't write 
the index.
I missed this earlier - this seems weird. Do we ever write more than one row 
group? Do we test it?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: n
> I just realized that the min/max values are only strings at this point, at
I think I like the idea of determining asc/desc incrementally instead of doing 
an extra pass. Doing it in Merge() could make sense, but I think we would have 
to clearly document what that means, since Merge() would no longer be a 
commutative operation.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 20 Mar 2018 22:21:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-20 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 2:

(12 comments)

Thanks for the comments! I fixed what I could, but also have some further 
questions.

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> The spec for page indexes actually allows to store min/max values that do n
Tim:

StringMinMaxFilter::MaterializeValues(): thanks for the tip, I'll use something 
similar.

Lars/Tim:
So, the cap should be 4kb?


About the case when the string starts with 4kb of 0xFF:
It seems to me that the behavior in this case should be decided by the Parquet 
community.

Anyway, what if we don't set the max_value in this case? Looking at 
HdfsParquetScanner::EvaluateStatsConjuncts() that should do the job. However, 
now the ordering of pages gets complicated. We need "no max_value and not null 
pages" at the end for ascending order. Null pages also don't have max_value, 
but they can be anywhere in the list of pages.

However, based on the comment in 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L797
 , we cannot do that with the current specification.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
> See my comment above, this should only be true for null pages.
I'm not sure if I've found the relevant comment.

There is a comment that says the spec allows to set min and max values that are 
not present in the column. But it is only for compaction, right? If there are 
no values in the column, does it make sense to set min and max values?

Yeah, the spec seems to allow that, but if Impala's implementation doesn't 
populate min and max values only for null pages, maybe we can use this 
information, can't we?

Or, I can of course smarten up the parquet ColumnStats class.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199
PS2, Line 1199:   uint8_t* buffer;
> declare in the most nested scope possible.
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207
PS2, Line 1207: column_index.__set_null_pages(column.null_pages_);
> Can we move() these to avoid a copy?
Yes, done.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
> Ideally the comparators would be implemented only once and we would call th
I just realized that the min/max values are only strings at this point, at it 
is kind of cumbersome to retrieve the type information here.

At L746 we call Merge() on row_group_stats_base_ to merge the page statistics 
into the column chunk statistics. Maybe Merge() could keep track of the 
ordering of the merged pages and then we could just call "GetBoundaryOrder()" 
on row_group_stats_base_. What do you think?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254
PS2, Line 1254:
> nit: extra blank line
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

PS2: 
> Is this file now identical with the upstream parquet.thrift?
No, I only added the changes of PARQUET-922.
diff says there's quite a lot differences between the two files.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683
PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
> I think this should live in its own file. It also needs a class comment.
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734
PS2, Line 734: tmp_dir = make_tmp_dir()
> This should use the tmpdir fixture from pytest (see other tests in this fil
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746
PS2, Line 746:   def _validate_ordering(ordering, schema, null_pages, values):
> This should have a function comment.
Added a comment.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785
PS2, Line 785:   # Not accurate. Does not make sure null_count is equal 
to number of rows.
> I think this should eventually be accurate (but I could be missing somethin
Now I'm using num_values from PageHeader.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835
PS2, Line 835:   def test_write_index_alltypes(self, vector, unique_database):
> Let's make sure that we cover all data types (the name of alltypes is unfor
Added tests for 

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-20 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#3).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds heavily on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M common/thrift/parquet.thrift
A tests/query_test/test_parquet_page_index.py
A tests/util/get_parquet_index.py
5 files changed, 488 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/3
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 2:

(12 comments)

I, too, left some high level comments.

http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@126
PS1, Line 126:   /// Write the column index and offset index of each page in 
the file.
Do we otherwise support writing more than one rowgroup?


http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@127
PS1, Line 127:
... would only be written... ?


http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@127
PS1, Line 127:
typo


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
> I'm concerned that this may introduce a non-trivial amount of untracked mem
The spec for page indexes actually allows to store min/max values that do not 
occur in the page: 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L802

We should compute shorter prefixes when possible and cut them off at 4kb.

We also need to decide what to do if a string starts with 4kb of 0xFF.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
See my comment above, this should only be true for null pages.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199
PS2, Line 1199:   uint8_t* buffer;
declare in the most nested scope possible.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
> I think we should avoid using builtin operators to compare values since the
Ideally the comparators would be implemented only once and we would call the 
same methods here and in the parquet stats.


http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

PS2:
Is this file now identical with the upstream parquet.thrift?


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683
PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
I think this should live in its own file. It also needs a class comment.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734
PS2, Line 734: tmp_dir = make_tmp_dir()
This should use the tmpdir fixture from pytest (see other tests in this file)


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746
PS2, Line 746:   def _validate_ordering(ordering, schema, null_pages, values):
This should have a function comment.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785
PS2, Line 785:   # Not accurate. Does not make sure null_count is equal 
to number of rows.
I think this should eventually be accurate (but I could be missing something).



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 16 Mar 2018 22:31:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-16 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector null_pages_;
I'm concerned that this may introduce a non-trivial amount of untracked memory, 
particularly for non-clustered dynamically-partitioned inserts where we have 
multiple table writers going at the same time. I don't think we cap the size of 
min/max values, so this is an additional concern.

StringMinMaxFilter::MaterializeValues() has some relevant logic to cap the size 
of strings to 1KB, which we should probably use.

I did a rough calculation for a 512MB parquet file. That could have 8192 64KB 
pages. Even with BIGINT values, that would be 128kb of stats, which is an 
amount we should be tracking. With 1kb strings that would be 8MB of untracked 
min/max values.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207
PS2, Line 1207: column_index.__set_null_pages(column.null_pages_);
Can we move() these to avoid a copy?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
I think we should avoid using builtin operators to compare values since the 
Parquet spec's order won't necessarily always match the C++ types order.

Having a helper function would make it clear that the code means the Parquet 
spec's order.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254
PS2, Line 1254:
nit: extra blank line


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835
PS2, Line 835:   def test_write_index_alltypes(self, vector, unique_database):
Let's make sure that we cover all data types (the name of alltypes is 
unfortunate, since it's missing DECIMAL, CHAR and VARCHAR).



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 16 Mar 2018 22:03:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-16 Thread Zoltan Borok-Nagy (Code Review)
Hello Lars Volker, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9693

to look at the new patch set (#2).

Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds heavily on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M common/thrift/parquet.thrift
M tests/query_test/test_insert_parquet.py
A tests/util/get_parquet_index.py
5 files changed, 433 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/2
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

2018-03-16 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9693


Change subject: IMPALA-5842: Write page index in Parquet files
..

IMPALA-5842: Write page index in Parquet files

This commit builds heavily on the previous work of
Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/

The commit implements the write path of PARQUET-922:
"Add column indexes to parquet.thrift". As specified in the
parquet-format, Impala writes the page indexes just before
the footer. This allows much more efficient page filtering
than using the same information from the 'statistics' field
of DataPageHeader.

I updated Pooja's python tests as well.

Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M common/thrift/parquet.thrift
M tests/query_test/test_insert_parquet.py
A tests/util/get_parquet_index.py
5 files changed, 433 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/9693/1
--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy