[jira] [Created] (HIVE-26528) TIMESTAMP stored via spark-shell DataFrame to Avro returns incorrect value when read using HiveCLI

2022-09-08 Thread xsys (Jira)
xsys created HIVE-26528:
---

 Summary: TIMESTAMP stored via spark-shell DataFrame to Avro 
returns incorrect value when read using HiveCLI
 Key: HIVE-26528
 URL: https://issues.apache.org/jira/browse/HIVE-26528
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 3.1.2
Reporter: xsys


h2. Describe the bug

We are trying to store a TIMESTAMP {{"2022" }}to a table created via Spark 
DataFrame. The table is created with the Avro file format. We encounter no 
errors while creating the table and inserting the aforementioned timestamp 
value. However, performing a SELECT query on the table through HiveCLI returns 
an incorrect value: "+53971-10-02 19:00:"

The root cause for this issue is the fact that Spark's 
[AvroSerializer|https://github.com/apache/spark/blob/v3.2.1/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L171-L180]
 serializes timestamps using Avro's 
[TIMESTAMP_MICRO|https://github.com/apache/avro/blob/ee4725c64807549ec74e20e83d35cfc1fe8e90a8/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java#L190]
 while Hive's 
[AvroDeserializer|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java#L320-L347]
 assumes timestamps to be Avro's [TIMESTAMP_MILLIS|#L189] during 
deserialization.
h2. Step to reproduce

On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package:

 
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
 

Execute the following:

 
{code:java}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = 
sc.parallelize(Seq(Row(Seq("2022").toDF("time").select(to_timestamp(col("time")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0
val schema = new StructType().add(StructField("c1", TimestampType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("ws") {code}
 

 

On [Hive 
3.1.2|https://archive.apache.org/dist/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz],
 execute the following:
{noformat}
hive> select * from ws;
OK
+53971-10-02 19:00:{noformat}
 
h2. Expected behavior

We expect the output of the {{SELECT}} query to be "{{{}2022-01-01 
00:00:00".{}}}We tried other formats like Parquet and the outcome is consistent 
with this expectation. Moreover, the timestamp is interpreted correctly when 
the table is written to via DataFrame and read via spark-shell/spark-sql:
h3. 
Can be read correctly from spark-shell:

 
{code:java}
scala> spark.sql("select * from ws;").show(false)
+---+
|c1                 |
+---+
|2022-01-01 00:00:00|
+---+{code}
 
h3. Can be read correctly from spark-sql:

 
{noformat}
spark-sql> select * from ws;
2022-01-01 00:00:00
Time taken: 0.063 seconds, Fetched 1 row(s){noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26527) Hive分区字段存在特殊字符,进行小文件合并时会出现分区字段值异常。

2022-09-08 Thread sunyan (Jira)
sunyan created HIVE-26527:
-

 Summary: Hive分区字段存在特殊字符,进行小文件合并时会出现分区字段值异常。
 Key: HIVE-26527
 URL: https://issues.apache.org/jira/browse/HIVE-26527
 Project: Hive
  Issue Type: Bug
Reporter: sunyan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26526) MSCK sync is not removing partitions with special characters

2022-09-08 Thread Naresh P R (Jira)
Naresh P R created HIVE-26526:
-

 Summary: MSCK sync is not removing partitions with special 
characters
 Key: HIVE-26526
 URL: https://issues.apache.org/jira/browse/HIVE-26526
 Project: Hive
  Issue Type: New Feature
Reporter: Naresh P R


PARTITIONS table were having encoding string & PARTITION_KEY_VALS were having 
original string.
{code:java}
hive=> select * from "PARTITION_KEY_VALS" where "PART_ID" IN (46753, 46754, 
46755, 46756);
 PART_ID |    PART_KEY_VAL     | INTEGER_IDX
-+-+-
   46753 | 2022-02-*           |           0
   46754 | 2011-03-01          |           0
   46755 | 2022-01-*           |           0
   46756 | 2010-01-01          |           0
   
   
hive=> select * from "PARTITIONS" where "TBL_ID" = 23567 ;
 PART_ID | CREATE_TIME | LAST_ACCESS_TIME |       PART_NAME       | SD_ID | 
TBL_ID | WRITE_ID
-+-+--+---+---++--
   46753 |           0 |                0 | part_date=2022-02-%2A | 70195 |  
23567 |        0
   46754 |           0 |                0 | part_date=2011-03-01  | 70196 |  
23567 |        0
   46755 |           0 |                0 | part_date=2022-01-%2A | 70197 |  
23567 |        0
   46756 |           0 |                0 | part_date=2010-01-01  | 70198 |  
23567 |        0
(4 rows){code}
 

1) DirectSQL has a join condition on PARTITION_KEY_VALS.PART_KEY_VAL = 
"2022-02-%2A" at here
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L883

2) Jdo is having filter condition on PARTITIONS.PART_NAME = 
"part_date=2022-02-%252A" (ie., 2 times url encoded)
Once from HS2
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java#L353
2nd from HMS
[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/parser/ExpressionTree.java#L365]

Above conditions returns 0 partitions, so those are not removed from HMS 
metadata.

 

Attaching repro q file 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26525) Update llap-server python scripts to be compatible with python 3

2022-09-08 Thread Simhadri Govindappa (Jira)
Simhadri Govindappa created HIVE-26525:
--

 Summary: Update llap-server python scripts to be compatible with 
python 3
 Key: HIVE-26525
 URL: https://issues.apache.org/jira/browse/HIVE-26525
 Project: Hive
  Issue Type: Task
Reporter: Simhadri Govindappa
Assignee: Simhadri Govindappa


llap-server/src/main/resources/package.py and 
/llap-server/src/main/resources/argparse.py are not compatible with python 3. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26524) Use Calcite to remove sections of a query plan known never produces rows

2022-09-08 Thread Krisztian Kasa (Jira)
Krisztian Kasa created HIVE-26524:
-

 Summary: Use Calcite to remove sections of a query plan known 
never produces rows
 Key: HIVE-26524
 URL: https://issues.apache.org/jira/browse/HIVE-26524
 Project: Hive
  Issue Type: Improvement
  Components: CBO
Reporter: Krisztian Kasa
Assignee: Krisztian Kasa


Calcite has a set of rules to remove sections of a query plan known never 
produces any rows. In some cases the whole plan can be removed. Such plans are 
represented with a single {{Values}} operators with no tuples.  ex.:
{code}
select y + 1 from (select a1 y, b1 z from t1 where b1 > 10) q WHERE 1=0
{code}
{code}
HiveValues(tuples=[[]])
{code}

Other cases when plan has outer join or set operators some branches can be 
replaced with empty values moving forward the join/set operator can be removed
{code}
select a2, b2 from t2 where 1=0
union
select a1, b1 from t1
{code}

{code}
HiveAggregate(group=[{0, 1}])
  HiveTableScan(table=[[default, t1]], table:alias=[t1])
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26523) Hive job stuck for a long time

2022-09-08 Thread Mayank Kunwar (Jira)
Mayank Kunwar created HIVE-26523:


 Summary: Hive job stuck for a long time
 Key: HIVE-26523
 URL: https://issues.apache.org/jira/browse/HIVE-26523
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 4.0.0-alpha-1
Reporter: Mayank Kunwar
Assignee: Mayank Kunwar


The default value of "hive.server2.tez.initialize.default.sessions" is true, 
due to which query was stuck on waiting to choose a session from default queue 
pool as the default queue pool size is set as 1 .

2022-07-10 16:34:23,831 INFO 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager: 
[HiveServer2-Background-Pool: Thread-184167]: Choosing a session from the 
defaultQueuePool
2022-07-10 18:15:48,295 INFO org.apache.hadoop.hive.ql.exec.tez.TezTask: 
[HiveServer2-Background-Pool: Thread-184167]: Subscribed to counters: [] for 
queryId: hive_20220710163423_c3f3deed-7a41-4865-9ce6-756fc7e6fbb8
2022-07-10 18:15:48,295 INFO org.apache.hadoop.hive.ql.exec.tez.TezTask: 
[HiveServer2-Background-Pool: Thread-184167]: Session is already open

 

A possible work around is to increase the value of 
"hive.server2.tez.sessions.per.default.queue"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)