[jira] [Commented] (SPARK-9414) HiveContext:saveAsTable creates wrong partition for existing hive table(append mode)

Xiu (Joe) Guo (JIRA) Wed, 03 Feb 2016 12:19:01 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131054#comment-15131054
 ]


Xiu (Joe) Guo commented on SPARK-9414:
--------------------------------------

With the current master 
[b938301|https://github.com/apache/spark/commit/b93830126cc59a26e2cfb5d7b3c17f9cfbf85988],
 I could not reproduce this issue by doing:

>From Hive 1.2.1 CLI:
{code}
create table test4DimBySpark (mydate int, hh int, x int, y int, height float, u 
float, v float, w float, ph float, phb float, p float, pb float, qva float, por 
float, qgraup float, qnice float, qnrain float, tke_pbl float, el_pbl float) 
partitioned by (zone int, z int, year int, month int);
{code}

In Spark-shell, use the first block of scala code from description to insert 
data.

I see correct partition directories in /user/hive/warehouse and Hive can read 
the data back fine.

Can you check with the newer versions of the code? It's probably fixed.

> HiveContext:saveAsTable creates wrong partition for existing hive 
> table(append mode)
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-9414
>                 URL: https://issues.apache.org/jira/browse/SPARK-9414
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Hadoop 2.6, Spark 1.4.0, Hive 0.14.0.
>            Reporter: Chetan Dalal
>            Priority: Critical
>
> Raising this bug because I found this issue was ready reported on Apache mail 
> archive and I am facing a similar issue.
> -----------original------------------------------
> I am using spark 1.4 and HiveContext to append data into a partitioned
> hive table. I found that the data insert into the table is correct, but the
> partition(folder) created is totally wrong.
> {code}
>  val schemaString = "zone z year month date hh x y height u v w ph phb 
> p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
>     val schema =
>       StructType(
>         schemaString.split(" ").map(fieldName =>
>           if (fieldName.equals("zone") || fieldName.equals("z") ||
> fieldName.equals("year") || fieldName.equals("month") ||
>               fieldName.equals("date") || fieldName.equals("hh") ||
> fieldName.equals("x") || fieldName.equals("y"))
>             StructField(fieldName, IntegerType, true)
>           else
>             StructField(fieldName, FloatType, true)
>         ))
> val pairVarRDD =
> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
>         0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
>     ))
> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource")
> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")
> {code}
> ---------------------------------------------------------------------------------------------
> The table contains 23 columns (longer than Tuple maximum length), so I
> use Row Object to store raw data, not Tuple.
> Here is some message from spark when it saved data>>
> {code}
> >>>>
> 15/06/16 10:39:22 INFO metadata.Hive: Renaming
> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0/part-00001;dest:
> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001;Status:true
> >>>>
> 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0
> with partSpec {zone=13195, z=0, year=0, month=0}
> >>>>
> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month =
> 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. (x)
> >>>>
> When I queried from hive>>
> >>>>
> hive> select * from test4dimBySpark;
> OK
> 2    42    2009    3    1.0    0.0    218.0    365.0    9989.497
> 29.627113    19.071793    0.11982734    -3174.6812    97735.2 16.389032
> -96.62891    25135.365    2.6476808E-5    0.0 13195    0    0    0
> hive> select zone, z, year, month from test4dimBySpark;
> OK
> 13195    0    0    0
> hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
> Found 2 items
> -rw-r--r--   3 patcharee hdfs       1411 2015-06-16 10:39
> /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001
> >>>>
> The data stored in the table is correct zone = 2, z = 42, year = 2009,
> month = 3, but the partition created was wrong
> "zone=13195/z=0/year=0/month=0" (x)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9414) HiveContext:saveAsTable creates wrong partition for existing hive table(append mode)

Reply via email to