[ https://issues.apache.org/jira/browse/SPARK-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131054#comment-15131054 ]
Xiu (Joe) Guo commented on SPARK-9414: -------------------------------------- With the current master [b938301|https://github.com/apache/spark/commit/b93830126cc59a26e2cfb5d7b3c17f9cfbf85988], I could not reproduce this issue by doing: >From Hive 1.2.1 CLI: {code} create table test4DimBySpark (mydate int, hh int, x int, y int, height float, u float, v float, w float, ph float, phb float, p float, pb float, qva float, por float, qgraup float, qnice float, qnrain float, tke_pbl float, el_pbl float) partitioned by (zone int, z int, year int, month int); {code} In Spark-shell, use the first block of scala code from description to insert data. I see correct partition directories in /user/hive/warehouse and Hive can read the data back fine. Can you check with the newer versions of the code? It's probably fixed. > HiveContext:saveAsTable creates wrong partition for existing hive > table(append mode) > ------------------------------------------------------------------------------------ > > Key: SPARK-9414 > URL: https://issues.apache.org/jira/browse/SPARK-9414 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.0 > Environment: Hadoop 2.6, Spark 1.4.0, Hive 0.14.0. > Reporter: Chetan Dalal > Priority: Critical > > Raising this bug because I found this issue was ready reported on Apache mail > archive and I am facing a similar issue. > -----------original------------------------------ > I am using spark 1.4 and HiveContext to append data into a partitioned > hive table. I found that the data insert into the table is correct, but the > partition(folder) created is totally wrong. > {code} > val schemaString = "zone z year month date hh x y height u v w ph phb > p pb qvapor qgraup qnice qnrain tke_pbl el_pbl" > val schema = > StructType( > schemaString.split(" ").map(fieldName => > if (fieldName.equals("zone") || fieldName.equals("z") || > fieldName.equals("year") || fieldName.equals("month") || > fieldName.equals("date") || fieldName.equals("hh") || > fieldName.equals("x") || fieldName.equals("y")) > StructField(fieldName, IntegerType, true) > else > StructField(fieldName, FloatType, true) > )) > val pairVarRDD = > sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), > 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), > 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) > )) > val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) > partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") > .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") > {code} > --------------------------------------------------------------------------------------------- > The table contains 23 columns (longer than Tuple maximum length), so I > use Row Object to store raw data, not Tuple. > Here is some message from spark when it saved data>> > {code} > >>>> > 15/06/16 10:39:22 INFO metadata.Hive: Renaming > src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0/part-00001;dest: > hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001;Status:true > >>>> > 15/06/16 10:39:22 INFO metadata.Hive: New loading path = > hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0 > with partSpec {zone=13195, z=0, year=0, month=0} > >>>> > From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = > 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. (x) > >>>> > When I queried from hive>> > >>>> > hive> select * from test4dimBySpark; > OK > 2 42 2009 3 1.0 0.0 218.0 365.0 9989.497 > 29.627113 19.071793 0.11982734 -3174.6812 97735.2 16.389032 > -96.62891 25135.365 2.6476808E-5 0.0 13195 0 0 0 > hive> select zone, z, year, month from test4dimBySpark; > OK > 13195 0 0 0 > hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; > Found 2 items > -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 > /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001 > >>>> > The data stored in the table is correct zone = 2, z = 42, year = 2009, > month = 3, but the partition created was wrong > "zone=13195/z=0/year=0/month=0" (x) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org