Konstantin Bereznyakov created HIVE-29618:
---------------------------------------------
Summary: Vectorized LIKE Operator should match multi-line input
(HIVE-22008 followup)
Key: HIVE-29618
URL: https://issues.apache.org/jira/browse/HIVE-29618
Project: Hive
Issue Type: Bug
Reporter: Konstantin Bereznyakov
h2. Description
The vectorized \{{LIKE}} operator's \{{ComplexChecker}} compiles its regex
via \{{Pattern.compile(pattern)}} without the \{{Pattern.DOTALL}} flag.
Patterns containing an unescaped \{{_}} wildcard route
through this path. By Java \{{Pattern}} default, \{{.}} does not match
\{{\n}}, so the anchored regex \{{^.*?<literal>.<literal>.*?$}} that the
\{{COMPLEX}} branch produces cannot consume newlines, and
multi-line inputs containing the literal substring are silently rejected. The
bug is silent.
The non-vectorized counterpart \{{UDFLike.evaluate}}
(\{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the
same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
Pattern.DOTALL)}} and returns the correct answer. The asymmetry was
introduced by HIVE-22008 ("LIKE Operator should match multi-line input",
2019-08-01), which added \{{Pattern.DOTALL}} to
\{{UDFLike}} only and did not update the parallel implementation in
\{{AbstractFilterStringColLikeStringScalar.ComplexChecker}}, created earlier by
HIVE-4642 (2013-10-03). The regression test added by
HIVE-22008 (\{{ql/src/test/queries/clientpositive/udf_like.q}}) does not
enable vectorized execution, so the vectorized \{{ComplexChecker}} path was
never exercised by it.
Reproduced on Apache Hive master, commit \{{1516fb91e8}} (\{{pom.xml}}
declares version \{{4.3.0-SNAPSHOT}}).
h2. When does the bug fire?
All of the following co-occurring conditions are required:
* Vectorized execution is on (default).
* The \{{LIKE}} pattern contains an unescaped \{{_}} (single-char wildcard).
This is the only path that reaches \{{ComplexChecker}}; patterns without \{{_}}
route to non-regex \{{NoneChecker}} /
\{{BeginChecker}} / \{{EndChecker}} / \{{MiddleChecker}} /
\{{ChainedChecker}}, which are byte-substring searches and handle newlines
correctly.
* The input string contains \{{\n}} (or another line terminator that \{{.}}
does not match by default) outside the literal portion of the pattern;
equivalently, the \{{.*?}} segments of the produced
regex would need to consume a newline to satisfy \{{^...$}}.
Removing any one condition masks the bug.
h2. Mechanism
The classifier
\{{AbstractFilterStringColLikeStringScalar.UDFLikePattern.matcher}} returns
\{{COMPLEX}} as soon as it sees an unescaped \{{_}}. The \{{COMPLEX}} branch
builds its regex as:
\{code:java}
"^" + UDFLike.likePatternToRegExp(pattern) + "$"
\{code}
\{{UDFLike.likePatternToRegExp}} translates \{{_}} to \{{.}}, \{{%}} to
\{{.*?}}, and other characters to \{{Pattern.quote}}-wrapped literals. For
pattern \{{%information_schema%}} the produced regex
(literals elided) is \{{^.*?information.schema.*?$}}. \{{ComplexChecker}}
then compiles it at
\{code:java}
compiledPattern = Pattern.compile(pattern);
\{code}
The non-vectorized counterpart at
\{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
\{code:java}
p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
\{code}
Without \{{Pattern.DOTALL}}, \{{.}} (and therefore \{{.*?}}) does not match
\{{\n}}. The \{{^...$}} anchors require the regex to consume the entire input,
but \{{.*?}} cannot cross newlines, so any
multi-line input containing the literal substring is silently rejected.
h2. Workaround
Escape the underscore so the pattern no longer reaches \{{ComplexChecker}}
(it routes through one of the byte-substring checkers instead):
\{code:sql}
SELECT q LIKE '%information\_schema%' FROM t;
\{code}
Or disable vectorized execution for the query:
\{code:sql}
SET hive.vectorized.execution.enabled=false;
\{code}
h2. Reproduction
\{code:sql}
SET hive.fetch.task.conversion=none;
/* Materialization to ORC is required so the LIKE filter scans a vectorized
input. UNION ALL of literals plans as a UDTF and disables vectorization
on the map vertex; the materialized intermediate restores it. ORC is
required because the default TextFile format treats \n as a row
terminator and truncates multi-line strings. */
SET hive.default.fileformat=ORC;
SET hive.optimize.cte.materialize.threshold=1;
SET hive.optimize.cte.materialize.full.aggregate.only=false;
WITH t AS (
SELECT 1 AS rid, 'simple match information_schema only one line'
AS q
UNION ALL SELECT 2 AS rid, concat('line one', chr(10), 'line two contains
information_schema here', chr(10)) AS q
UNION ALL SELECT 3 AS rid, concat('prefix', chr(10), 'more', chr(10),
'stuff', chr(10), 'information_schema appears late') AS q
UNION ALL SELECT 4 AS rid, concat('information_schema is on the first
line', chr(10), 'but there are more lines', chr(10), 'after it') AS q
)
SELECT rid,
CASE WHEN q LIKE '%information_schema%' THEN 'match' ELSE 'NO_MATCH'
END AS result
FROM t
ORDER BY rid;
\{code}
*Expected:* all four rows return \{{match}}. Each row's \{{q}} contains the
literal substring \{{information_schema}}, and \{{%information_schema%}}
matches that substring under SQL semantics (the
literal \{{_}} satisfies the \{{_}} single-char wildcard).
*Actual:*
\{noformat}
1 match
2 NO_MATCH
3 NO_MATCH
4 NO_MATCH
\{noformat}
Only row 1 (single-line) is matched. Rows 2, 3, and 4 each contain \{{\n}}
outside the \{{information_schema}} run and are dropped. Running the same query
with {{SET
hive.vectorized.execution.enabled=false}} returns all four rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)