Re: HiveContext saveAsTable create wrong partition
If you are writing to an existing hive table, our insert into operator follows hive's requirement, which is "*the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order** in which they appear in the PARTITION() clause*." You can find requirement in https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions. If you use select to reorder columns, I think it should work. Also, since the table is an existing hive table, you do not need to specify the format because we will use the format of existing table. btw, please feel free to open a jira about removing this requirement for inserting into an existing hive table. Thanks, Yin On Thu, Jun 18, 2015 at 9:39 PM, Yin Huai wrote: > Are you writing to an existing hive orc table? > > On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian wrote: > >> Thanks for reporting this. Would you mind to help creating a JIRA for >> this? >> >> >> On 6/16/15 2:25 AM, patcharee wrote: >> >>> I found if I move the partitioned columns in schemaString and in Row to >>> the end of the sequence, then it works correctly... >>> >>> On 16. juni 2015 11:14, patcharee wrote: >>> Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet>> --- val schemaString = "zone z year month date hh x y height u v w ph phb t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl" val schema = StructType( schemaString.split(" ").map(fieldName => if (fieldName.equals("zone") || fieldName.equals("z") || fieldName.equals("year") || fieldName.equals("month") || fieldName.equals("date") || fieldName.equals("hh") || fieldName.equals("x") || fieldName.equals("y")) StructField(fieldName, IntegerType, true) else StructField(fieldName, FloatType, true) )) val pairVarRDD = sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) )) val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") --- The table contains 23 columns (longer than Tuple maximum length), so I use Row Object to store raw data, not Tuple. Here is some message from spark when it saved data>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true 15/06/16 10:39:22 INFO metadata.Hive: New loading path = hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 with partSpec {zone=13195, z=0, year=0, month=0} From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. When I queried from hive>> hive> select * from test4dimBySpark; OK 242200931.00.0218.0365.09989.497 29.62711319.0717930.11982734-3174.681297735.2 16.389032 -96.6289125135.3652.6476808E-50.0 13195000 hive> select zone, z, year, month from test4dimBySpark; OK 13195000 hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; Found 2 items -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 The data stored in the table is correct zone = 2, z = 42, year = 2009, month = 3, but the partition created was wrong "zone=13195/z=0/year=0/month=0" Is this a bug or what could be wrong? Any sugge
Re: HiveContext saveAsTable create wrong partition
Are you writing to an existing hive orc table? On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian wrote: > Thanks for reporting this. Would you mind to help creating a JIRA for this? > > > On 6/16/15 2:25 AM, patcharee wrote: > >> I found if I move the partitioned columns in schemaString and in Row to >> the end of the sequence, then it works correctly... >> >> On 16. juni 2015 11:14, patcharee wrote: >> >>> Hi, >>> >>> I am using spark 1.4 and HiveContext to append data into a partitioned >>> hive table. I found that the data insert into the table is correct, but the >>> partition(folder) created is totally wrong. >>> Below is my code snippet>> >>> >>> --- >>> >>> val schemaString = "zone z year month date hh x y height u v w ph phb t >>> p pb qvapor qgraup qnice qnrain tke_pbl el_pbl" >>> val schema = >>> StructType( >>> schemaString.split(" ").map(fieldName => >>> if (fieldName.equals("zone") || fieldName.equals("z") || >>> fieldName.equals("year") || fieldName.equals("month") || >>> fieldName.equals("date") || fieldName.equals("hh") || >>> fieldName.equals("x") || fieldName.equals("y")) >>> StructField(fieldName, IntegerType, true) >>> else >>> StructField(fieldName, FloatType, true) >>> )) >>> >>> val pairVarRDD = >>> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), >>> >>> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), >>> >>> 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) >>> )) >>> >>> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) >>> >>> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") >>> >>> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") >>> >>> >>> --- >>> >>> >>> The table contains 23 columns (longer than Tuple maximum length), so I >>> use Row Object to store raw data, not Tuple. >>> Here is some message from spark when it saved data>> >>> >>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming >>> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: >>> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true >>> >>> 15/06/16 10:39:22 INFO metadata.Hive: New loading path = >>> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 >>> with partSpec {zone=13195, z=0, year=0, month=0} >>> >>> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3. >>> But spark created a partition {zone=13195, z=0, year=0, month=0}. >>> >>> When I queried from hive>> >>> >>> hive> select * from test4dimBySpark; >>> OK >>> 242200931.00.0218.0365.09989.497 >>> 29.62711319.0717930.11982734-3174.681297735.2 16.389032 >>> -96.6289125135.3652.6476808E-50.0 13195000 >>> hive> select zone, z, year, month from test4dimBySpark; >>> OK >>> 13195000 >>> hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; >>> Found 2 items >>> -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 >>> /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 >>> >>> The data stored in the table is correct zone = 2, z = 42, year = 2009, >>> month = 3, but the partition created was wrong >>> "zone=13195/z=0/year=0/month=0" >>> >>> Is this a bug or what could be wrong? Any suggestion is appreciated. >>> >>> BR, >>> Patcharee >>> >>> >>> >>> >>> >>> >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: HiveContext saveAsTable create wrong partition
Thanks for reporting this. Would you mind to help creating a JIRA for this? On 6/16/15 2:25 AM, patcharee wrote: I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet>> --- val schemaString = "zone z year month date hh x y height u v w ph phb t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl" val schema = StructType( schemaString.split(" ").map(fieldName => if (fieldName.equals("zone") || fieldName.equals("z") || fieldName.equals("year") || fieldName.equals("month") || fieldName.equals("date") || fieldName.equals("hh") || fieldName.equals("x") || fieldName.equals("y")) StructField(fieldName, IntegerType, true) else StructField(fieldName, FloatType, true) )) val pairVarRDD = sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) )) val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") --- The table contains 23 columns (longer than Tuple maximum length), so I use Row Object to store raw data, not Tuple. Here is some message from spark when it saved data>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true 15/06/16 10:39:22 INFO metadata.Hive: New loading path = hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 with partSpec {zone=13195, z=0, year=0, month=0} From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. When I queried from hive>> hive> select * from test4dimBySpark; OK 242200931.00.0218.0365.09989.497 29.62711319.0717930.11982734-3174.681297735.2 16.389032-96.6289125135.3652.6476808E-50.0 13195 000 hive> select zone, z, year, month from test4dimBySpark; OK 13195000 hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; Found 2 items -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 The data stored in the table is correct zone = 2, z = 42, year = 2009, month = 3, but the partition created was wrong "zone=13195/z=0/year=0/month=0" Is this a bug or what could be wrong? Any suggestion is appreciated. BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HiveContext saveAsTable create wrong partition
I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using spark 1.4 and HiveContext to append data into a partitioned hive table. I found that the data insert into the table is correct, but the partition(folder) created is totally wrong. Below is my code snippet>> --- val schemaString = "zone z year month date hh x y height u v w ph phb t p pb qvapor qgraup qnice qnrain tke_pbl el_pbl" val schema = StructType( schemaString.split(" ").map(fieldName => if (fieldName.equals("zone") || fieldName.equals("z") || fieldName.equals("year") || fieldName.equals("month") || fieldName.equals("date") || fieldName.equals("hh") || fieldName.equals("x") || fieldName.equals("y")) StructField(fieldName, IntegerType, true) else StructField(fieldName, FloatType, true) )) val pairVarRDD = sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) )) val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") --- The table contains 23 columns (longer than Tuple maximum length), so I use Row Object to store raw data, not Tuple. Here is some message from spark when it saved data>> 15/06/16 10:39:22 INFO metadata.Hive: Renaming src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true 15/06/16 10:39:22 INFO metadata.Hive: New loading path = hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 with partSpec {zone=13195, z=0, year=0, month=0} From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. When I queried from hive>> hive> select * from test4dimBySpark; OK 242200931.00.0218.0365.09989.497 29.62711319.0717930.11982734-3174.681297735.2 16.389032-96.6289125135.3652.6476808E-50.0 13195 000 hive> select zone, z, year, month from test4dimBySpark; OK 13195000 hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; Found 2 items -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 The data stored in the table is correct zone = 2, z = 42, year = 2009, month = 3, but the partition created was wrong "zone=13195/z=0/year=0/month=0" Is this a bug or what could be wrong? Any suggestion is appreciated. BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org