[jira] [Commented] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

Stephane Maarek (JIRA) Tue, 12 Apr 2016 18:02:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238352#comment-15238352
 ]


Stephane Maarek commented on SPARK-14583:
-----------------------------------------

pretty much the same behavior if instead of MSCK REPAIR we run ALTER TABLE 
spark_testing.test_csv ADD PARTITION (part_a="a", part_b="b");
This makes me believe it's the partitioning that makes Spark fail

> Spark doesn't read hive table properly after MSCK REPAIR
> --------------------------------------------------------
>
>                 Key: SPARK-14583
>                 URL: https://issues.apache.org/jira/browse/SPARK-14583
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.5.1
>            Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> a,2
> ,3
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a       2       a       b
> NULL    3       a       b
> (you can see the NULL)
> now onto Spark:
> +--------+--------+------+------+
> |column_1|column_2|part_a|part_b|
> +--------+--------+------+------+
> |       a|       2|     a|     b|
> |        |       3|     a|     b|
> +--------+--------+------+------+
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

Reply via email to