[GitHub] [hudi] zuyanton opened a new issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

GitBox Sun, 31 Jan 2021 17:21:17 -0800


zuyanton opened a new issue #2509:
URL: https://github.com/apache/hudi/issues/2509



   **Describe the problem you faced**
   
   It looks like org.apache.spark.sql.types.TimestampType when saved to hudi 
table gets converted to bigInt
   
   **To Reproduce**
   
   create dataframe with TimestampType  
   ```
   var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
   var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
   df= df.withColumn("timestamp", col("time_string").cast(TimestampType))
   ```  
   preview dataframe 
   ```
   df.show
   ```
   ```
   +---+-------------------+---------+--------+-------------------+
   | pk|        time_string|partition|sort_key|          timestamp|
   +---+-------------------+---------+--------+-------------------+
   |  1|2020-01-01 11:22:30|        2|       2|2020-01-01 11:22:30|
   +---+-------------------+---------+--------+-------------------+
   ```
   save dataframe to hudi table 
   ```
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")
   ```  
   view hudi table   
   ```
   spark.sql("select * from testTable2").show
   ```
   result, timestamp column is bigint   
   ```
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| pk|        time_string|sort_key|       
timestamp|partition|
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
   |     20210201004527|  20210201004527_0_1|              pk:1|                
     2|2972ef96-279b-438...|  1|2020-01-01 11:22:30|       2|1577877750000000|  
      2|
   
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------------------+--------+----------------+---------+
   ```
   view schema 
   ```
   spark.sql("describe testTable2").show
   ```
   result
   ```
   +--------------------+---------+-------+
   |            col_name|data_type|comment|
   +--------------------+---------+-------+
   | _hoodie_commit_time|   string|   null|
   |_hoodie_commit_seqno|   string|   null|
   |  _hoodie_record_key|   string|   null|
   |_hoodie_partition...|   string|   null|
   |   _hoodie_file_name|   string|   null|
   |                  pk|      int|   null|
   |         time_string|   string|   null|
   |            sort_key|      int|   null|
   |           timestamp|   bigint|   null|
   |           partition|      int|   null|
   |# Partition Infor...|         |       |
   |          # col_name|data_type|comment|
   |           partition|      int|   null|
   +--------------------+---------+-------+
   ```
   
   
   **Environment Description**
   
   * Hudi version : 0.7.0
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   full code snippet
   ```
       import org.apache.spark.sql.functions._
       import org.apache.spark.sql.types._
       import org.apache.hudi.hive.MultiPartKeysValueExtractor
       import org.apache.hudi.QuickstartUtils._
       import scala.collection.JavaConversions._
       import org.apache.spark.sql.SaveMode
       import org.apache.hudi.DataSourceReadOptions._
       import org.apache.hudi.DataSourceWriteOptions._
       import org.apache.hudi.DataSourceWriteOptions
       import org.apache.hudi.config.HoodieWriteConfig._
       import org.apache.hudi.config.HoodieWriteConfig
       import org.apache.hudi.keygen.ComplexKeyGenerator
       import org.apache.hudi.common.model.DefaultHoodieRecordPayload
       import org.apache.hadoop.hive.conf.HiveConf
       val hiveConf = new HiveConf()
       val hiveMetastoreURI = 
hiveConf.get("hive.metastore.uris").replaceAll("thrift://", "")
       val hiveServer2URI = hiveMetastoreURI.substring(0, 
hiveMetastoreURI.lastIndexOf(":"))
       var hudiOptions = Map[String,String](
         HoodieWriteConfig.TABLE_NAME → "testTable2",
         "hoodie.consistency.check.enabled"->"true",
         DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk",
         DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> 
classOf[ComplexKeyGenerator].getName,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->"partition",
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "sort_key",
         DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY → "true",
         DataSourceWriteOptions.HIVE_TABLE_OPT_KEY → "testTable2",
         DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY → "partition",
         DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY → 
classOf[MultiPartKeysValueExtractor].getName,
         DataSourceWriteOptions.HIVE_URL_OPT_KEY 
->s"jdbc:hive2://$hiveServer2URI:10000",
         "hoodie.payload.ordering.field" -> "sort_key",
         DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY -> 
classOf[DefaultHoodieRecordPayload].getName
       )
   
   //spark.sql("drop table if exists testTable1")
   var seq = Seq((1, "2020-01-01 11:22:30", 2, 2))
   var df = seq.toDF("pk", "time_string" , "partition", "sort_key")
   df= df.withColumn("timestamp", col("time_string").cast(TimestampType))
   df.show
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://location")
   spark.sql("select * from testTable2").show
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zuyanton opened a new issue #2509: [SUPPORT]Hudi saves TimestampType as bigInt

Reply via email to