[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238352#comment-15238352 ]
Stephane Maarek commented on SPARK-14583: ----------------------------------------- pretty much the same behavior if instead of MSCK REPAIR we run ALTER TABLE spark_testing.test_csv ADD PARTITION (part_a="a", part_b="b"); This makes me believe it's the partitioning that makes Spark fail > Spark doesn't read hive table properly after MSCK REPAIR > -------------------------------------------------------- > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 1.5.1 > Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > a,2 > ,3 > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL 3 a b > (you can see the NULL) > now onto Spark: > +--------+--------+------+------+ > |column_1|column_2|part_a|part_b| > +--------+--------+------+------+ > | a| 2| a| b| > | | 3| a| b| > +--------+--------+------+------+ > As you can see, SPARK can't detect the null. > I don't know if it affects future versions of SPARK and I can't test it in my > company's environment. Steps are easy to reproduce though so can be tested in > other environments. My hive version is 1.2.1 > Let me know if you have any questions. To me that's a big issue because data > isn't read correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org