[jira] [Comment Edited] (HIVE-27775) DirectSQL and JDO results are different when fetching partitions by timestamp in DST shift

Zhihua Deng (Jira) Thu, 07 Dec 2023 19:35:05 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-27775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794502#comment-17794502
 ]


Zhihua Deng edited comment on HIVE-27775 at 12/8/23 3:34 AM:
-------------------------------------------------------------

On Jdo path, we use the partition name to fetch the matched partitions, 
{code:java}
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS 
DN_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME AS 
NUCORDER0,A0.WRITE_ID,A0.PART_ID FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON 
A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE 
B0.TBL_NAME = <'payments'> AND C0."NAME" = <'default'> AND C0.CTLG_NAME = 
<'hive'> AND A0.PART_NAME = <'txn_datetime=2023-03-26 03%3A30%3A00'> ORDER BY 
NUCORDER0 {code}
In the above example, the filter A0.PART_NAME = <'txn_datetime=2023-03-26 
03%3A30%3A00'> A0.PART_NAME is timezone agnostic, this could lead to wrong 
result if <'txn_datetime=2023-03-26 03%3A30%3A00'> is timezone based partition 
name.

Compared to direct mode, 
{code:java}
select "PARTITIONS"."PART_ID" from "PARTITIONS"  inner join "TBLS" on 
"PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 
<'payments'>   inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and 
"DBS"."NAME" = <'default'> inner join "PARTITION_KEY_VALS" "FILTER0" on 
"FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 
where "DBS"."CTLG_NAME" = <'hive'>  and ((cast((case when 
"FILTER0"."PART_KEY_VAL" <> <'__HIVE_DEFAULT_PARTITION__'> and 
"TBLS"."TBL_NAME" = <'payments'> and "DBS"."NAME" = <'default'> and 
"DBS"."CTLG_NAME" = <'hive'> and "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" 
and "FILTER0"."INTEGER_IDX" = 0 then cast("FILTER0"."PART_KEY_VAL" as 
TIMESTAMP) else null end) as TIMESTAMP) = <'2023-03-26 03:30:00'>)) {code}
the filter is cast("FILTER0"."PART_KEY_VAL" as TIMESTAMP) else null end) as 
TIMESTAMP) = <'2023-03-26 03:30:00'>, so if we push a timezone based 
<'2023-03-26 03:30:00'> and the underlying database can handle 
cast("FILTER0"."PART_KEY_VAL" as TIMESTAMP) else null end) as TIMESTAMP) 
properly, then we could get the expected result.

However when I switch the backing db to Postgres, the direct sql throws an 
exception:
{noformat}
Caused by: org.postgresql.util.PSQLException: ERROR: operator does not exist: 
timestamp without time zone = character varying
  Hint: No operator matches the given name and argument types. You might need 
to add explicit type casts.
  Position: 662{noformat}
MySQL as well:
{code:java}
Caused by: java.sql.SQLSyntaxErrorException: You have an error in your SQL 
syntax; check the manual that corresponds to your MySQL server version for the 
right syntax to use near 'TIMESTAMP) else null end) as TIMESTAMP) = '2023-03-26 
03:30:00'))' at line 1 {code}
So I think we should fix the errors on direct sql path and make the 
timestamp/date timezone agnostic on JDO path.


was (Author: dengzh):
On Jdo path, we use the partition name to fetch the matched partitions, 
{code:java}
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS 
DN_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME AS 
NUCORDER0,A0.WRITE_ID,A0.PART_ID FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON 
A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE 
B0.TBL_NAME = <'payments'> AND C0."NAME" = <'default'> AND C0.CTLG_NAME = 
<'hive'> AND A0.PART_NAME = <'txn_datetime=2023-03-26 03%3A30%3A00'> ORDER BY 
NUCORDER0 {code}
In the above example, the filter A0.PART_NAME = <'txn_datetime=2023-03-26 
03%3A30%3A00'> A0.PART_NAME is timezone agnostic, this could lead to wrong 
result if <'txn_datetime=2023-03-26 03%3A30%3A00'> is timezone based timestamp.

Compared to direct mode, 
{code:java}
select "PARTITIONS"."PART_ID" from "PARTITIONS"  inner join "TBLS" on 
"PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 
<'payments'>   inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and 
"DBS"."NAME" = <'default'> inner join "PARTITION_KEY_VALS" "FILTER0" on 
"FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 
where "DBS"."CTLG_NAME" = <'hive'>  and ((cast((case when 
"FILTER0"."PART_KEY_VAL" <> <'__HIVE_DEFAULT_PARTITION__'> and 
"TBLS"."TBL_NAME" = <'payments'> and "DBS"."NAME" = <'default'> and 
"DBS"."CTLG_NAME" = <'hive'> and "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" 
and "FILTER0"."INTEGER_IDX" = 0 then cast("FILTER0"."PART_KEY_VAL" as 
TIMESTAMP) else null end) as TIMESTAMP) = <'2023-03-26 03:30:00'>)) {code}
the filter is cast("FILTER0"."PART_KEY_VAL" as TIMESTAMP) else null end) as 
TIMESTAMP) = <'2023-03-26 03:30:00'>, so if we push a timezone based 
<'2023-03-26 03:30:00'> and the underlying database can handle 
cast("FILTER0"."PART_KEY_VAL" as TIMESTAMP) else null end) as TIMESTAMP) 
properly, then we could get the expected result.

However when I switch the backing db to Postgres, the direct sql throws an 
exception:
{noformat}
Caused by: org.postgresql.util.PSQLException: ERROR: operator does not exist: 
timestamp without time zone = character varying
  Hint: No operator matches the given name and argument types. You might need 
to add explicit type casts.
  Position: 662{noformat}
MySQL as well:
{code:java}
Caused by: java.sql.SQLSyntaxErrorException: You have an error in your SQL 
syntax; check the manual that corresponds to your MySQL server version for the 
right syntax to use near 'TIMESTAMP) else null end) as TIMESTAMP) = '2023-03-26 
03:30:00'))' at line 1 {code}
So I think we should fix the errors on direct sql path and make the 
timestamp/date timezone agnostic on JDO path.

> DirectSQL and JDO results are different when fetching partitions by timestamp 
> in DST shift
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-27775
>                 URL: https://issues.apache.org/jira/browse/HIVE-27775
>             Project: Hive
>          Issue Type: Bug
>          Components: Standalone Metastore
>    Affects Versions: 4.0.0-beta-1
>            Reporter: Stamatis Zampetakis
>            Assignee: Zhihua Deng
>            Priority: Critical
>
> DirectSQL and JDO results are different when fetching partitions by timestamp 
> in DST shift.
> {code:sql}
> --! qt:timezone:Europe/Paris
> CREATE EXTERNAL TABLE payments (card string) PARTITIONED BY(txn_datetime 
> TIMESTAMP) STORED AS ORC;
> INSERT into payments VALUES('3333-4444-2222-9999', '2023-03-26 02:30:00');
> SELECT * FROM payments WHERE txn_datetime = '2023-03-26 02:30:00';
> {code}
> The '2023-03-26 02:30:00' is a timestamp that in Europe/Paris timezone falls 
> exactly in the middle of the DST shift. In this particular timezone this date 
> time never really exists since we are jumping directly from 02:00:00 to 
> 03:00:00. However, the TIMESTAMP data type in Hive is timezone agnostic 
> (https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types) 
> so it is a perfectly valid timestamp that can be inserted in a table and we 
> must be able to recover it back.
> For the SELECT query above, partition pruning kicks in and calls the 
> ObjectStore#getPartitionsByExpr method in order to fetch the respective 
> partitions matching the timestamp from HMS.
> The tests however reveal that DirectSQL and JDO paths are not returning the 
> same results leading to an exception when VerifyingObjectStore is used. 
> According to the error below DirectSQL is able to recover one partition from 
> HMS (expected) while JDO/ORM returns empty (not expected).
> {noformat}
> 2023-10-06T03:51:19,406 ERROR [80252df4-3fdc-4971-badf-ad67ce8567c7 main] 
> metastore.VerifyingObjectStore: Lists are not the same size: SQL 1, ORM 0
> 2023-10-06T03:51:19,409 ERROR [80252df4-3fdc-4971-badf-ad67ce8567c7 main] 
> metastore.RetryingHMSHandler: MetaException(message:Lists are not the same 
> size: SQL 1, ORM 0)
>       at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.verifyLists(VerifyingObjectStore.java:148)
>       at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:88)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
>       at com.sun.proxy.$Proxy57.getPartitionsByExpr(Unknown Source)
>       at 
> org.apache.hadoop.hive.metastore.HMSHandler.get_partitions_spec_by_expr(HMSHandler.java:7330)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:98)
>       at 
> org.apache.hadoop.hive.metastore.AbstractHMSHandlerProxy.invoke(AbstractHMSHandlerProxy.java:82)
>       at com.sun.proxy.$Proxy59.get_partitions_spec_by_expr(Unknown Source)
>       at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsSpecByExprInternal(HiveMetaStoreClient.java:2472)
>       at 
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientWithLocalCache.getPartitionsSpecByExprInternal(HiveMetaStoreClientWithLocalCache.java:396)
>       at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getPartitionsSpecByExprInternal(SessionHiveMetaStoreClient.java:2279)
>       at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsSpecByExpr(HiveMetaStoreClient.java:2484)
>       at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.listPartitionsSpecByExpr(SessionHiveMetaStoreClient.java:1346)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213)
>       at com.sun.proxy.$Proxy60.listPartitionsSpecByExpr(Unknown Source)
>       at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByExpr(Hive.java:4507)
>       at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.getPartitionsFromServer(PartitionPruner.java:457)
>       at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.prune(PartitionPruner.java:230)
>       at 
> org.apache.hadoop.hive.ql.optimizer.calcite.RelOptHiveTable.computePartitionList(RelOptHiveTable.java:480)
>       at 
> org.apache.hadoop.hive.ql.optimizer.calcite.rules.HivePartitionPruneRule.perform(HivePartitionPruneRule.java:63)
>       at 
> org.apache.hadoop.hive.ql.optimizer.calcite.rules.HivePartitionPruneRule.onMatch(HivePartitionPruneRule.java:46)
>       at 
> org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:333)
>       at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:542)
>       at 
> org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:407)
>       at 
> org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:243)
>       at 
> org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:127)
>       at 
> org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:202)
>       at 
> org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:189)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.executeProgram(CalcitePlanner.java:2471)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.executeProgram(CalcitePlanner.java:2430)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.applyPreJoinOrderingTransforms(CalcitePlanner.java:1968)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1712)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1593)
>       at 
> org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131)
>       at 
> org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914)
>       at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180)
>       at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1345)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:572)
>       at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13048)
>       at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:467)
>       at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
>       at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224)
>       at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:107)
>       at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:519)
>       at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:471)
>       at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:436)
>       at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:430)
>       at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121)
>       at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227)
>       at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257)
>       at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
>       at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
>       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425)
>       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356)
>       at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:733)
>       at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:703)
>       at 
> org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:115)
>       at 
> org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
>       at 
> org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HIVE-27775) DirectSQL and JDO results are different when fetching partitions by timestamp in DST shift

Reply via email to