[GitHub] [iceberg] zhangdove opened a new issue #1831: Hive: create and write iceberg by hive catalog using Spark, Hive client read no data

GitBox Wed, 25 Nov 2020 23:20:31 -0800


zhangdove opened a new issue #1831:
URL: https://github.com/apache/iceberg/issues/1831



   1. Environment
   ```
   spark: 3.0.0
   hive: 2.3.7
   iceberg: 0.10.0
   ```
   2. SparkSession configuration
   ```scala
       val spark = SparkSession
         .builder()
         .master("local[2]")
         .appName("IcebergAPI")
         .config("spark.sql.catalog.hive_prod", 
"org.apache.iceberg.spark.SparkCatalog")
         .config("spark.sql.catalog.hive_prod.type", "hive")
         .config("spark.sql.catalog.hive_prod.uri", "thrift://localhost:9083")
         .enableHiveSupport()
         .getOrCreate()
   ```
   3. Create database `db` by hive client
   ```
   ➜  bin ./beeline
   beeline> !connect jdbc:hive2://localhost:10000 hive hive
   Connecting to jdbc:hive2://localhost:10000
   Connected to: Apache Hive (version 2.3.7)
   Driver: Hive JDBC (version 2.3.7)
   Transaction isolation: TRANSACTION_REPEATABLE_READ
   0: jdbc:hive2://localhost:10000> create database db;
   No rows affected (0.105 seconds)
   ```
   3. Create iceberg table by hiveCatalog using Spark (Link: 
https://iceberg.apache.org/hive/#using-hive-catalog)
   ```scala
     def createByHiveCatalog(spark: SparkSession): Unit = {
       val hadoopConfiguration = spark.sparkContext.hadoopConfiguration
       
hadoopConfiguration.set(org.apache.iceberg.hadoop.ConfigProperties.ENGINE_HIVE_ENABLED,
 "true"); //iceberg.engine.hive.enabled=true
       val hiveCatalog = new HiveCatalog(hadoopConfiguration)
       val nameSpace = Namespace.of("db")
       val tableIdentifier: TableIdentifier = TableIdentifier.of(nameSpace, 
"tb")
   
       val columns: List[Types.NestedField] = new ArrayList[Types.NestedField]
       columns.add(Types.NestedField.of(1, true, "id", Types.IntegerType.get, 
"id doc"))
       columns.add(Types.NestedField.of(2, true, "ts", 
Types.TimestampType.withZone(), "ts doc"))
   
       val schema: Schema = new Schema(columns)
       val partition = PartitionSpec.builderFor(schema).year("ts").build()
   
       hiveCatalog.createTable(tableIdentifier, schema, partition)
     }
   ```
   4. Query iceberg table by hive client
   ```hive
   0: jdbc:hive2://localhost:10000> add jar 
/Users/dovezhang/software/idea/github/iceberg/hive-runtime/build/libs/iceberg-hive-runtime-0.10.0.jar;
   No rows affected (0.043 seconds)
   0: jdbc:hive2://localhost:10000> set iceberg.mr.catalog=hive;
   No rows affected (0.003 seconds)
   0: jdbc:hive2://localhost:10000> select * from db.tb;
   +--------+--------+
   | tb.id  | tb.ts  |
   +--------+--------+
   +--------+--------+
   No rows selected (1.166 seconds)
   ```
   5. Write data by hive Catalog using Spark
   ```scala
     case class dbtb(id: Int, time: Timestamp)
     def writeDataToIcebergHive(spark: SparkSession): Unit = {
       val seq = Seq(dbtb(1, Timestamp.valueOf("2020-07-06 13:40:00")),
         dbtb(2, Timestamp.valueOf("2020-07-06 14:30:00")),
         dbtb(3, Timestamp.valueOf("2020-07-06 15:20:00")))
       val df = spark.createDataFrame(seq).toDF("id", "ts")
   
       import org.apache.spark.sql.functions
       df.writeTo(s"hive_prod.db.tb").overwrite(functions.lit(true))
     }
   ```
   
   6. Query iceberg table by hive client
   ```
   0: jdbc:hive2://localhost:10000> select * from db.tb;
   +--------+--------+
   | tb.id  | tb.ts  |
   +--------+--------+
   +--------+--------+
   No rows selected (0.152 seconds)
   ```
   After writing the data, no data is returned via Hive Client.
   
   7. Query iceberg table by hive catalog using Spark
   ```
     def readIcebergByHiveCatalog(spark: SparkSession): Unit = {
       spark.sql("select * from hive_prod.db.tb").show(false)
     }
   ```
   Result
   ```
   +---+-------------------+
   |id |ts                 |
   +---+-------------------+
   |1  |2020-07-06 13:40:00|
   |2  |2020-07-06 14:30:00|
   |3  |2020-07-06 15:20:00|
   +---+-------------------+
   ```
   8. Check the table of contents for data files
   ```
   ➜  bin hdfs dfs -ls /usr/hive/warehouse/db.db/tb/data/ts_year=2020
   20/11/26 15:16:51 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Found 2 items
   -rw-r--r--   1 dovezhang supergroup        656 2020-11-26 15:11 
/usr/hive/warehouse/db.db/tb/data/ts_year=2020/00000-0-2b98be41-8347-4a8c-a986-d28878ab7a67-00001.parquet
   -rw-r--r--   1 dovezhang supergroup        664 2020-11-26 15:11 
/usr/hive/warehouse/db.db/tb/data/ts_year=2020/00001-1-b192e846-5a6a-4ee9-b31a-7e5fcf813b88-00001.parquet
   ```
   
   I am not sure why the Hive Client cannot see data after Spark builds tables 
and writes data. Have anyone know it?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhangdove opened a new issue #1831: Hive: create and write iceberg by hive catalog using Spark, Hive client read no data

Reply via email to