Benjamin BONNET created SPARK-16996: ---------------------------------------
Summary: Hive ACID delta files not seen Key: SPARK-16996 URL: https://issues.apache.org/jira/browse/SPARK-16996 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: Hive 1.2.1, Spark 1.5.2 Reporter: Benjamin BONNET Priority: Critical spark-sql seems not to see data stored as delta files in an ACID Hive table. Actually I encountered the same problem as describe here : http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp For example, create an ACID table with HiveCLI and insert a row : {code} set hive.support.concurrency=true; set hive.enforce.bucketing=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on=true; set hive.compactor.worker.threads=1; CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('transactional'='true'); INSERT INTO deltas VALUES("a","a"); {code} Then make a query with spark-sql CLI : {code} SELECT * FROM deltas; {code} That query gets no result and there are no errors in logs. If you go to HDFS to inspect table files, you find only deltas {code} ~>hdfs dfs -ls /apps/hive/warehouse/deltas Found 1 items drwxr-x--- - me hdfs 0 2016-08-10 14:03 /apps/hive/warehouse/deltas/delta_0020943_0020943 {code} Then if you run compaction on that table (in HiveCLI) : {code} ALTER TABLE deltas COMPACT 'MAJOR'; {code} As a result, the delta will be compute into a base file : {code} ~>hdfs dfs -ls /apps/hive/warehouse/deltas Found 1 items drwxrwxrwx - me hdfs 0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943 {code} Go back to spark-sql and the same query gets a result : {code} SELECT * FROM deltas; a a Time taken: 0.477 seconds, Fetched 1 row(s) {code} But next time you make an insert into Hive table : {code} INSERT INTO deltas VALUES("b","b"); {code} spark-sql will immediately see changes : {code} SELECT * FROM deltas; a a b b Time taken: 0.122 seconds, Fetched 2 row(s) {code} Yet there was no other compaction, but spark-sql "sees" the base AND the delta file : {code} ~> hdfs dfs -ls /apps/hive/warehouse/deltas Found 2 items drwxrwxrwx - valdata hdfs 0 2016-08-10 15:25 /apps/hive/warehouse/deltas/base_0020943 drwxr-x--- - valdata hdfs 0 2016-08-10 15:31 /apps/hive/warehouse/deltas/delta_0020956_0020956 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org