lamber-ken commented on issue #1105: [WIP] [HUDI-405] Fix sync no hive 
partition at first time
URL: https://github.com/apache/incubator-hudi/pull/1105#issuecomment-569981982
 
 
   > > So, modify assumeDatePartitioning to ! assumeDatePartitioning is the 
best way to fix this issue.
   > 
   > This is not going to fix this issue.. and we should not be changing this 
code.. It will cause side effects, like I mentioned before.
   > 
   > Can we reproduce this issue in a unit test or sample code first?
   
   hi, follow bellow steps can reproduce this issuse
   
   1, Define the PartitionValueExtractor
   ```
   package org.apache.hudi.hive;
   
   import org.joda.time.DateTime;
   import org.joda.time.format.DateTimeFormat;
   import org.joda.time.format.DateTimeFormatter;
   
   import java.util.Collections;
   import java.util.List;
   
   public class DayPartitionValueExtractor implements PartitionValueExtractor {
   
       private transient DateTimeFormatter dtfOut;
   
       public DayPartitionValueExtractor() {
           this.dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd");
       }
   
       private DateTimeFormatter getDtfOut() {
           if (dtfOut == null) {
               dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd");
           }
           return dtfOut;
       }
   
       @Override
       public List<String> extractPartitionValuesInPath(String partitionPath) {
           String[] splits = partitionPath.split("-");
           if (splits.length != 3) {
               throw new IllegalArgumentException(
                       "Partition path " + partitionPath + " is not in the form 
yyyy-mm-dd ");
           }
           int year = Integer.parseInt(splits[0]);
           int mm = Integer.parseInt(splits[1]);
           int dd = Integer.parseInt(splits[2]);
           DateTime dateTime = new DateTime(year, mm, dd, 0, 0);
           return Collections.singletonList(getDtfOut().print(dateTime));
       }
   }
   ```
   
   2, Write data by spark-shell
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6
   $${SPARK_HOME}/bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.SaveMode
   
   val basePath = "/tmp/hoodie_test"
   var datas = List("""{ "key": "uuid", "event_time": 1574297893836, 
"part_date": "2019-11-12"}""")
   val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   
   df.write.format("hudi").
       option("hoodie.insert.shuffle.parallelism", "10").
       option("hoodie.upsert.shuffle.parallelism", "10").
       option("hoodie.delete.shuffle.parallelism", "10").
       option("hoodie.bulkinsert.shuffle.parallelism", "10").
   
       option("hoodie.datasource.hive_sync.enable", true).
       option("hoodie.datasource.hive_sync.jdbcurl", 
"jdbc:hive2://0.0.0.0:12326").
       option("hoodie.datasource.hive_sync.username", "dcadmin").
       option("hoodie.datasource.hive_sync.password", "dcadmin").
       option("hoodie.datasource.hive_sync.database", "default").
       option("hoodie.datasource.hive_sync.table", "hoodie_test").
       option("hoodie.datasource.hive_sync.partition_fields", "part_date").
   
       option("hoodie.datasource.hive_sync.assume_date_partitioning", true).
       option("hoodie.datasource.hive_sync.partition_extractor_class", 
"org.apache.hudi.hive.DayPartitionValueExtractor").
   
       option("hoodie.datasource.write.precombine.field", "event_time").
       option("hoodie.datasource.write.recordkey.field", "key").
       option("hoodie.datasource.write.partitionpath.field", "part_date").
   
       option("hoodie.table.name", "hoodie_test").
       mode(SaveMode.Overwrite).
       save(basePath);
   ```
   
   3, Query data from hive
   ```
   no data
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to