[
https://issues.apache.org/jira/browse/IMPALA-14116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065455#comment-18065455
]
Fang-Yu Rao edited comment on IMPALA-14116 at 3/15/26 9:47 PM:
---------------------------------------------------------------
Another call site of {{GetSearchArgumentLiteral()}} is
{{PrepareBinaryPredicate()}} in
[hdfs-orc-scanner.cc|https://github.com/apache/impala/blob/master/be/src/exec/orc/hdfs-orc-scanner.cc].
{code:c++}
bool HdfsOrcScanner::PrepareBinaryPredicate(const string& fn_name, uint64_t
orc_column_id,
const ColumnType& type, ScalarExprEvaluator* eval,
orc::SearchArgumentBuilder* sarg) {
orc::PredicateDataType predicate_type;
orc::Literal literal = GetSearchArgumentLiteral(eval, /*child_idx*/1, type,
&predicate_type);
...
}
{code}
The issue of instantiating a {{orc::Literal}} with a pointer could potentially
be encountered above too. For instance, consider the following SQL statement.
{code:sql}
select string_col from functional_orc_def.alltypestiny where string_col > null;
{code}
If that null literal could be pushed to the scanner, then we again fail the
validation (in {{{}validate(){}}}) in
[PredicateLeaf.cc|https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/PredicateLeaf.cc#L55]
as shown in [^resolved_PrepareBinaryPredicate.txt].
{code:java}
22 impalad!orc::PredicateLeaf::validate() const [PredicateLeaf.cc : 136 + 0x16]
rsp = 0x000079bd72717670 rip = 0x00000000011434fd
23 impalad!orc::PredicateLeaf::PredicateLeaf(orc::PredicateLeaf::Operator,
orc::PredicateDataType, unsigned long, orc::Literal) [PredicateLeaf.cc : 55 +
0x8]
rsp = 0x000079bd727176e0 rip = 0x0000000003ed65d3
24 impalad!orc::SearchArgumentBuilder&
orc::SearchArgumentBuilderImpl::compareOperator<unsigned
long>(orc::PredicateLeaf::Operator, unsigned long, orc::PredicateDataType,
orc::Literal) [SearchArgument.cc : 124 + 0x21]
rsp = 0x000079bd72717710 rip = 0x0000000003e9e8fd
28 impalad!orc::SearchArgumentBuilderImpl::lessThanEquals(unsigned long,
orc::PredicateDataType, orc::Literal) [SearchArgument.cc : 155 + 0x16]
rsp = 0x000079bd72717850 rip = 0x0000000003e9882e
29
impalad!impala::HdfsOrcScanner::PrepareBinaryPredicate(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, unsigned long,
impala::ColumnType const&, impala::ScalarExprEvaluator*,
orc::SearchArgumentBuilder*) [hdfs-orc-scanner.cc : 1266 + 0x13]
rsp = 0x000079bd727178c0 rip = 0x0000000001fcf93f
31 impalad!impala::HdfsOrcScanner::PrepareSearchArguments()
[hdfs-orc-scanner.cc : 1405 + 0x22]
rsp = 0x000079bd727179c0 rip = 0x0000000001fd49c9
{code}
Currently that null literal is not pushed to the scanner, since in
{{HdfsScanNode#tryComputeBinaryStatsPredicate()}} we do not add that null
literal to {{statsConjuncts_}} in {{{}buildBinaryStatsPredicate(){}}}. Instead,
currently we
[return|https://github.com/apache/impala/blob/ef2d50e/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L720]
directly.
{code:java}
private void tryComputeBinaryStatsPredicate(Analyzer analyzer,
BinaryPredicate binaryPred) {
// We only support slot refs on the left hand side of the predicate, a
rewriting
// rule makes sure that all compatible exprs are rewritten into this form.
Only
// implicit casts are supported.
SlotRef slotRef = binaryPred.getChild(0).unwrapSlotRef(true);
if (slotRef == null) return;
SlotDescriptor slotDesc = slotRef.getDesc();
// This node is a table scan, so this must be a scanning slot.
Preconditions.checkState(slotDesc.isScanSlot());
// Skip the slot ref if it refers to an array's "pos" field.
if (slotDesc.isArrayPosRef()) return;
Expr constExpr = binaryPred.getChild(1);
// Only constant exprs can be evaluated against parquet::Statistics. This
includes
// LiteralExpr, but can also be an expr like "1 + 2".
if (!constExpr.isConstant()) return;
if (Expr.IS_NULL_VALUE.apply(constExpr)) return; // We return directly so
the null literal is not pushed down.
...
}
{code}
So if we comment out that "{{{}if (Expr.IS_NULL_VALUE.apply(constExpr))
return;{}}}" to force that null literal to be pushed down to the scan node, we
could crash Impala processes and obtain the stack trace provided above.
Therefore, we should probably find the places where we instantiate
{{orc::Literal}} with a pointer in
[hdfs-orc-scanner.cc|https://github.com/apache/impala/blob/master/be/src/exec/orc/hdfs-orc-scanner.cc]
and change the input argument to {{orc::PredicateDataType}} instead of a
pointer to {{{}orc::PredicateDataType{}}}.
*+Edit:+*
I found another two regressions corresponding to the types of date and
decimals. We could use the following to reproduce the issue.
{code:sql}
select * from functional_orc_def.date_tbl where date_col in (null);
select * from functional_orc_def.decimal_tbl where d1 in (null);
{code}
was (Author: fangyurao):
Another call site of {{GetSearchArgumentLiteral()}} is
{{PrepareBinaryPredicate()}} in
[hdfs-orc-scanner.cc|https://github.com/apache/impala/blob/master/be/src/exec/orc/hdfs-orc-scanner.cc].
{code:c++}
bool HdfsOrcScanner::PrepareBinaryPredicate(const string& fn_name, uint64_t
orc_column_id,
const ColumnType& type, ScalarExprEvaluator* eval,
orc::SearchArgumentBuilder* sarg) {
orc::PredicateDataType predicate_type;
orc::Literal literal = GetSearchArgumentLiteral(eval, /*child_idx*/1, type,
&predicate_type);
...
}
{code}
The issue of instantiating a {{orc::Literal}} with a pointer could potentially
be encountered above too. For instance, consider the following SQL statement.
{code:sql}
select string_col from functional_orc_def.alltypestiny where string_col > null;
{code}
If that null literal could be pushed to the scanner, then we again fail the
validation (in {{{}validate(){}}}) in
[PredicateLeaf.cc|https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/PredicateLeaf.cc#L55]
as shown in [^resolved_PrepareBinaryPredicate.txt].
{code:java}
22 impalad!orc::PredicateLeaf::validate() const [PredicateLeaf.cc : 136 + 0x16]
rsp = 0x000079bd72717670 rip = 0x00000000011434fd
23 impalad!orc::PredicateLeaf::PredicateLeaf(orc::PredicateLeaf::Operator,
orc::PredicateDataType, unsigned long, orc::Literal) [PredicateLeaf.cc : 55 +
0x8]
rsp = 0x000079bd727176e0 rip = 0x0000000003ed65d3
24 impalad!orc::SearchArgumentBuilder&
orc::SearchArgumentBuilderImpl::compareOperator<unsigned
long>(orc::PredicateLeaf::Operator, unsigned long, orc::PredicateDataType,
orc::Literal) [SearchArgument.cc : 124 + 0x21]
rsp = 0x000079bd72717710 rip = 0x0000000003e9e8fd
28 impalad!orc::SearchArgumentBuilderImpl::lessThanEquals(unsigned long,
orc::PredicateDataType, orc::Literal) [SearchArgument.cc : 155 + 0x16]
rsp = 0x000079bd72717850 rip = 0x0000000003e9882e
29
impalad!impala::HdfsOrcScanner::PrepareBinaryPredicate(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, unsigned long,
impala::ColumnType const&, impala::ScalarExprEvaluator*,
orc::SearchArgumentBuilder*) [hdfs-orc-scanner.cc : 1266 + 0x13]
rsp = 0x000079bd727178c0 rip = 0x0000000001fcf93f
31 impalad!impala::HdfsOrcScanner::PrepareSearchArguments()
[hdfs-orc-scanner.cc : 1405 + 0x22]
rsp = 0x000079bd727179c0 rip = 0x0000000001fd49c9
{code}
Currently that null literal is not pushed to the scanner, since in
{{HdfsScanNode#tryComputeBinaryStatsPredicate()}} we do not add that null
literal to {{statsConjuncts_}} in {{{}buildBinaryStatsPredicate(){}}}. Instead,
currently we
[return|https://github.com/apache/impala/blob/ef2d50e/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L720]
directly.
{code:java}
private void tryComputeBinaryStatsPredicate(Analyzer analyzer,
BinaryPredicate binaryPred) {
// We only support slot refs on the left hand side of the predicate, a
rewriting
// rule makes sure that all compatible exprs are rewritten into this form.
Only
// implicit casts are supported.
SlotRef slotRef = binaryPred.getChild(0).unwrapSlotRef(true);
if (slotRef == null) return;
SlotDescriptor slotDesc = slotRef.getDesc();
// This node is a table scan, so this must be a scanning slot.
Preconditions.checkState(slotDesc.isScanSlot());
// Skip the slot ref if it refers to an array's "pos" field.
if (slotDesc.isArrayPosRef()) return;
Expr constExpr = binaryPred.getChild(1);
// Only constant exprs can be evaluated against parquet::Statistics. This
includes
// LiteralExpr, but can also be an expr like "1 + 2".
if (!constExpr.isConstant()) return;
if (Expr.IS_NULL_VALUE.apply(constExpr)) return; // We return directly so
the null literal is not pushed down.
...
}
{code}
So if we comment out that "{{{}if (Expr.IS_NULL_VALUE.apply(constExpr))
return;{}}}" to force that null literal to be pushed down to the scan node, we
could crash Impala processes similarly with the stack trace provided above.
Therefore, we should probably find the places where we instantiate
{{orc::Literal}} with a pointer in
[hdfs-orc-scanner.cc|https://github.com/apache/impala/blob/master/be/src/exec/orc/hdfs-orc-scanner.cc]
and change the input argument to {{orc::PredicateDataType}} instead of a
pointer to {{{}orc::PredicateDataType{}}}.
*+Edit:+*
I found another two regressions corresponding to the types of date and
decimals. We could use the following to reproduce the issue.
{code:sql}
select * from functional_orc_def.date_tbl where date_col in (null);
select * from functional_orc_def.decimal_tbl where d1 in (null);
{code}
> Consider erroring out earlier if NULL is on the IN-list of a table scan
> against an ORC table
> --------------------------------------------------------------------------------------------
>
> Key: IMPALA-14116
> URL: https://issues.apache.org/jira/browse/IMPALA-14116
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Fang-Yu Rao
> Assignee: Fang-Yu Rao
> Priority: Major
> Attachments: resolved_PrepareBinaryPredicate.txt,
> resolved_crashed_thread.txt
>
>
> We found that currently if we include NULL on the IN-list of a table scan
> against an ORC table, Impala daemons could crash. This could be reproduced in
> the following.
> # Create the database and an ORC table under the database in impala-shell.
> {code}
> create database test_db_04;
> CREATE EXTERNAL TABLE test_db_04.test_tbl_01 (customer_id STRING)
> PARTITIONED BY (ingest_date STRING)
> WITH SERDEPROPERTIES ('serialization.format'='1')
> STORED AS ORC;
> {code}
> # Insert a row into the ORC table just created via beeline.
> {code}
> INSERT INTO test_db_04.test_tbl_01 partition (ingest_date='2025-05-29')
> values ('CUST001');
> {code}
> # Execute the following query via impala-shell.
> {code}
> SELECT ingest_date, customer_id
> FROM test_db_04.test_tbl_01 WHERE ingest_date > DATE '2024-09-30' AND
> customer_id IN ('', NULL)
> GROUP BY 1, 2;
> {code}
> An Impala daemon would crash during the execution of the ORC table scan. The
> stack trace of the crashed thread in the resolved minidump is also provided
> in [^resolved_crashed_thread.txt].
> We should consider erroring out earlier if NULL is on the IN-list of a table
> scan against an ORC table to prevent any Impala daemon from crashing, maybe
> during the query analysis.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]