[
https://issues.apache.org/jira/browse/HIVE-26653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045379#comment-18045379
]
Soumyakanti Das commented on HIVE-26653:
----------------------------------------
After extensive analysis, I am closing this issue as {*}"Won't Fix"{*}. The bug
is no longer reproducible on the current master branch without manually
reverting an unrelated commit.
*Root Cause Analysis:*
For the query mentioned in the description, in Vectorizer.isBigTableOnlyResults
method we see a constant expression ('20220731') being pushed to the SmallTable
side. It incorrectly interprets this as a dependency on SmallTable data (value
deserialization) and returns {{{}false{}}}. This forces the selection of the
general-purpose {{VectorMapJoinInnerStringOperator}} (HashMap) instead of the
optimized {{VectorMapJoinInnerBigOnlyStringOperator}} (HashMultiSet).
Since the SmallTable subquery was key-only ({{{}SELECT p_dt ... GROUP BY
p_dt{}}}), the hash map correctly stores an *empty value payload*
({{{}length=0{}}}) for the key. The operator passes this empty payload to the
deserializer which interprets the empty bytes as {{{}NULL{}}}.
*Evaluation of Fixes:*
There are three places where a fix could land, all of which were deemed
unsuitable:
* *Operator-Level Fix{^}[1]{^}:* Patching
{{VectorMapJoinGenerateResultOperator}} to detect the empty payload and
manually copy the BigTable key. This treats the symptom, not the root cause,
and introduces extra responsibility to the deserialization which is not
warranted.
* *Vectorizer Fix:* Modifying {{isBigTableOnlyResults}} to allow {{BIG_ONLY}}
selection when the SmallTable schema has no value columns, and remapping the
projection. But this has high complexity and high risk of regression because it
touches a lot of different variables that maintain state of the code.
* *MapJoinDesc Fix{^}[2]{^}:* Moving constants to the BigTable side so that
the small table retain list is empty. This has to be done before populating the
retainList, as shown below. This works but being in the MapJoinDesc, it has a
much larger scope and thus could impact other kinds of joins too (e.g.,
changing Outer Join semantics).
*Conclusion:*
Since the issue is no longer reproducible and the risks outweighs the benefits,
I think we should abandon this. Do note that disabling vectorization could be a
workaround if we encounter this or something similar in production. But the
chances of anyone hitting this in production seems negligible.
[1]: [https://github.com/apache/hive/pull/6165]
[2] This is what I have tried in MapJoinDesc
(moveConstantExprsToBigTableSide()), leaving it here for future reference:
{noformat}
private void moveConstantExprsToBigTableSide() {
Map<Byte, List<ExprNodeDesc>> exprs = getExprs();
Map<Byte, List<ExprNodeDesc>> optimizedExprs = new HashMap<>();
for (Byte b: exprs.keySet()) {
List<ExprNodeDesc> listOfExprs = exprs.get(b);
if (b == posBigTable || listOfExprs.isEmpty()) {
optimizedExprs.put(b, listOfExprs);
}
List<ExprNodeDesc> listOfConstantExprs = new ArrayList<>();
List<ExprNodeDesc> listOfNonConstantExprs = new ArrayList<>();
for (ExprNodeDesc expr: listOfExprs) {
if (expr instanceof ExprNodeConstantDesc) {
listOfConstantExprs.add(expr);
} else {
listOfNonConstantExprs.add(expr);
}
}
if (!listOfConstantExprs.isEmpty()) {
if (optimizedExprs.containsKey((byte) posBigTable)) {
optimizedExprs.get((byte)posBigTable).addAll(listOfConstantExprs);
} else {
optimizedExprs.put((byte)posBigTable, listOfConstantExprs);
}
} else {
optimizedExprs.put(b, listOfNonConstantExprs);
}
}
setExprs(optimizedExprs);
}
private void initRetainExprList() {
moveConstantExprsToBigTableSide();
retainList = new HashMap<Byte, List<Integer>>();
Set<Entry<Byte, List<ExprNodeDesc>>> set = super.getExprs().entrySet();
Iterator<Entry<Byte, List<ExprNodeDesc>>> setIter = set.iterator();
while (setIter.hasNext()) {
Entry<Byte, List<ExprNodeDesc>> current = setIter.next();
List<Integer> list = new ArrayList<Integer>();
for (int i = 0; i < current.getValue().size(); i++) {
list.add(i);
}
retainList.put(current.getKey(), list);
}
}{noformat}
> Wrong results when (map) joining multiple tables on partition column
> --------------------------------------------------------------------
>
> Key: HIVE-26653
> URL: https://issues.apache.org/jira/browse/HIVE-26653
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Affects Versions: 4.2.0
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
> Labels: pull-request-available
> Attachments: hive_26653.q, hive_26653_explain.txt,
> hive_26653_explain_cbo.txt, table_a.csv, table_b.csv
>
>
> The result of the query must have exactly one row matching the date specified
> in the WHERE clause but the query returns nothing.
> {code:sql}
> CREATE TABLE table_a (`aid` string ) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tbla.csv' into TABLE table_a;
> CREATE TABLE table_b (`bid` string) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tblb.csv' into TABLE table_b;
> set hive.auto.convert.join=true;
> set hive.optimize.semijoin.conversion=false;
> SELECT a.p_dt
> FROM ((SELECT p_dt
> FROM table_b
> GROUP BY p_dt) a
> JOIN
> (SELECT p_dt
> FROM table_a
> GROUP BY p_dt) b ON a.p_dt = b.p_dt
> JOIN
> (SELECT p_dt
> FROM table_a
> GROUP BY p_dt) c ON a.p_dt = c.p_dt)
> WHERE a.p_dt = translate(cast(to_date(date_sub('2022-08-01', 1)) AS string),
> '-', '');
> {code}
> +Expected result+
> 20220731
> +Actual result+
> Empty
> To reproduce the problem the tables need to have some data. Values in aid and
> bid columns are not important. For p_dt column use one of the following
> values 20220731, 20220630.
> I will attach some sample data with which the problem can be reproduced. The
> tables look like below.
> ||aid|pdt||
> |611|20220731|
> |239|20220630|
> |...|...|
> The problem can be reproduced via qtest in current master
> (commit
> [6b05d64ce8c7161415d97a7896ea50025322e30a|https://github.com/apache/hive/commit/6b05d64ce8c7161415d97a7896ea50025322e30a])
> by running the TestMiniLlapLocalCliDriver.
> There is specific query plan (will attach shortly) for which the problem
> shows up so if the plan changes slightly the problem may not appear anymore;
> this is why we need to set explicitly hive.optimize.semijoin.conversion and
> hive.auto.convert.join to trigger the problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)