Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15403 )

Change subject: IMPALA-6505: Min-Max predicate push down in ORC scanner
......................................................................


Patch Set 4:

(21 comments)

Added tests and revoled the comments. Let me know if I missed any.

http://gerrit.cloudera.org:8080/#/c/15403/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15403/1//COMMIT_MSG@19
PS1, Line 19:  - Run scanner tests on ORC files.
> Some ideas for more tests:
Done. Before ORC-961, we can only check "RowsRead" in the profile.


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.h@180
PS2, Line 180: o whe
> nit. would search_argument_pool sounds better?
We have such names in other places meaning MemPool for all permanent 
allocations. I feel like we can keep this name..


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.h@316
PS2, Line 316: r* slot);
> nit: can it be const?
We need to use ScalarExprEvaluator::GetValue() which is unfortunately not 
declared as const..


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.h@317
PS2, Line 317:
> nit: we usually put output parameters at the end of the parameter list
Done


http://gerrit.cloudera.org:8080/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@888
PS1, Line 888:   int tuple_idx = 0;
             :   while (!scan_node_-
> There shouldn't be too much problem if the table is written and read using
Let's skip pushing down predicates on CHAR/VARCHAR since there are also other 
issues like IMPALA-1652 and HIVE-24040. I filed IMPALA-10882 to continue this 
work.


http://gerrit.cloudera.org:8080/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@959
PS1, Line 959:       *predicate_type = orc::PredicateDataType::TIMESTAMP;
             :       if (!val) return orc::Literal(predicate_type);
             :       const TimestampValue* tv = reinterpret_cast<const 
TimestampValue*>(val);
             :       int64_t nanos;
> Same as above, wait for ORC-612?
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@631
PS2, Line 631:     int num_row_to_commit = TransferScratchTuples(row_batch);
> Can this be removed?
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@833
PS2, Line 833: row_id < capacity && !o
> nit. Please add a comment: T is the intended ORC primitive type and U is Im
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@834
PS2, Line 834: nullptr);
> nit. maybe renamed as GetORCPrimitiveLiteral.
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@836
PS2, Line 836: TURN_
> UNLIKELY()?
I think this is not in the hot path. But it's harmless to add it. Done.


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@874
PS2, Line 874: t i
> UNLIKELY()?
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@888
PS2, Line 888:   int tuple_idx = 0;
> Then should we add a DCHECK?
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@895
PS2, Line 895: t nu
> UNLIKELY()?
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@901
PS2, Line 901: m_row
> UNLIKELY()?
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@905
PS2, Line 905: e (num_
> need to check NULL
Nice catch!


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@917
PS2, Line 917:
             :
             : template<typename T, typena
> These three local variables can be relocated to each of the relevant cases.
Done


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@941
PS2, Line 941: *predicate_type = orc::PredicateDataType::LONG;
> I wonder if we should return Status for this method instead of ORC literal
Nice catch! For failure cases we all return NULL literals. I added a check in 
the caller.


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/exec/hdfs-orc-scanner.cc@980
PS2, Line 980:    // The date should be valid at this point.
             :       ignore_result(dv->ToDaysSinceEpoch(&value));
             :       return orc::Literal(*predicate_type, value);
             :     }
             :     case TYPE_STRING:
             :     case TYPE_VARCHAR:{
             :       *predicate_type = orc::PredicateDataType::STRING;
             :       if (!val) return orc::Literal(predicate_type);
             :       const StringValue* sv = 
reinterpret_cast<StringValue*>(val);
             :       return orc::Lite
> I think for a min/max filter F, we need to formulate the following two push
Will address this in IMPALA-10878.


http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/15403/2/be/src/service/query-options.h@50
PS2, Line 50: ORC_READ_STATISTICS
> ORC_PUSHDOWN_PREDICATES probably sounds better for this patch. Maybe there
There is an existing query option PARQUET_READ_STATISTICS. Maybe it's ok to 
keep this name?


http://gerrit.cloudera.org:8080/#/c/15403/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15403/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@543
PS1, Line 543:     statsPred.analyzeNoThrow(analyzer);
> ORC lib supports pushing down EQ predicates. I think we don't need to trans
Will do this in IMPALA-10873


http://gerrit.cloudera.org:8080/#/c/15403/2/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/15403/2/tests/query_test/test_nested_types.py@165
PS2, Line 165:   @SkipIfABFS.hive
             :   @SkipIfADLS.hive
> Should we change the reason here? Like "this test is specific to Parquet, f
Done



--
To view, visit http://gerrit.cloudera.org:8080/15403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Gerrit-Change-Number: 15403
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward (520)
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Aug 2021 11:50:53 +0000
Gerrit-HasComments: Yes

Reply via email to