[ https://issues.apache.org/jira/browse/HUDI-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo updated HUDI-4784: ---------------------------- Status: In Progress (was: Open) > Full-record bootstrap does not generate correct partition path > -------------------------------------------------------------- > > Key: HUDI-4784 > URL: https://issues.apache.org/jira/browse/HUDI-4784 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.12.0 > Reporter: Ethan Guo > Assignee: Ethan Guo > Priority: Major > Fix For: 0.13.0 > > > The source partitioned parquet table is structured based on year/month/day. > The bootstrap operation performs both metadata_only and full_record bootstrap > using Spark datasource. The partitions with full_record bootstrap do not > have the correct partition path generated in the target Hudi table. > {code:java} > val srcPath = "<>/bootstrap-testing/partitioned-parquet-table-date" > val basePath = "<>/bootstrap-testing/bootstrap-hudi-table-2" > val bootstrapDF = spark.emptyDataFrame > bootstrapDF.write > .format("hudi") > .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "partition") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts") > .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath) > .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, > classOf[SimpleKeyGenerator].getName) > .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR, > classOf[BootstrapRegexModeSelector].getName) > .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX, > "2022/1/2[4-8]") > .option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE, > "METADATA_ONLY") > .option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER, > classOf[SparkParquetBootstrapDataProvider].getName) > .mode(SaveMode.Overwrite) > .save(basePath) {code} > {code:java} > scala> spark.sql("select _hoodie_partition_path, count(*) from test_table > group by _hoodie_partition_path ").show > +----------------------+--------+ > > |_hoodie_partition_path|count(1)| > +----------------------+--------+ > | __HIVE_DEFAULT_PA...| 249540| > | 2022/1/28| 49730| > | 2022/1/27| 50150| > | 2022/1/24| 49735| > | 2022/1/26| 51005| > | 2022/1/25| 49845| > +----------------------+--------+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)