[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up
[ https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HUDI-2214: Fix Version/s: (was: 0.10.0) 0.9.0 > residual temporary files after clustering are not cleaned up > > > Key: HUDI-2214 > URL: https://issues.apache.org/jira/browse/HUDI-2214 > Project: Apache Hudi > Issue Type: Bug > Components: Cleaner >Affects Versions: 0.8.0 > Environment: spark3.1.1 > hadoop3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > residual temporary files after clustering are not cleaned up > // test step > step1: do clustering > val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList > val inputDF1: Dataset[Row] = > spark.read.json(spark.sparkContext.parallelize(records1, 2)) > inputDF1.write.format("org.apache.hudi") > .options(commonOpts) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > // option for clustering > .option("hoodie.parquet.small.file.limit", "0") > .option("hoodie.clustering.inline", "true") > .option("hoodie.clustering.inline.max.commits", "1") > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > "1073741824") > .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") > .option("hoodie.clustering.plan.strategy.max.bytes.per.group", > Long.MaxValue.toString) > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > String.valueOf(12 *1024 * 1024L)) > .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, > begin_lon") > .mode(SaveMode.Overwrite) > .save(basePath) > step2: check the temp dir, we find > /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty > {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 > {color} > is not cleaned up. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up
[ https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HUDI-2214: Status: In Progress (was: Open) > residual temporary files after clustering are not cleaned up > > > Key: HUDI-2214 > URL: https://issues.apache.org/jira/browse/HUDI-2214 > Project: Apache Hudi > Issue Type: Bug > Components: Cleaner >Affects Versions: 0.8.0 > Environment: spark3.1.1 > hadoop3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > residual temporary files after clustering are not cleaned up > // test step > step1: do clustering > val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList > val inputDF1: Dataset[Row] = > spark.read.json(spark.sparkContext.parallelize(records1, 2)) > inputDF1.write.format("org.apache.hudi") > .options(commonOpts) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > // option for clustering > .option("hoodie.parquet.small.file.limit", "0") > .option("hoodie.clustering.inline", "true") > .option("hoodie.clustering.inline.max.commits", "1") > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > "1073741824") > .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") > .option("hoodie.clustering.plan.strategy.max.bytes.per.group", > Long.MaxValue.toString) > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > String.valueOf(12 *1024 * 1024L)) > .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, > begin_lon") > .mode(SaveMode.Overwrite) > .save(basePath) > step2: check the temp dir, we find > /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty > {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 > {color} > is not cleaned up. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up
[ https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-2214: - Labels: pull-request-available (was: ) > residual temporary files after clustering are not cleaned up > > > Key: HUDI-2214 > URL: https://issues.apache.org/jira/browse/HUDI-2214 > Project: Apache Hudi > Issue Type: Bug > Components: Cleaner >Affects Versions: 0.8.0 > Environment: spark3.1.1 > hadoop3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > residual temporary files after clustering are not cleaned up > // test step > step1: do clustering > val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList > val inputDF1: Dataset[Row] = > spark.read.json(spark.sparkContext.parallelize(records1, 2)) > inputDF1.write.format("org.apache.hudi") > .options(commonOpts) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > // option for clustering > .option("hoodie.parquet.small.file.limit", "0") > .option("hoodie.clustering.inline", "true") > .option("hoodie.clustering.inline.max.commits", "1") > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > "1073741824") > .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") > .option("hoodie.clustering.plan.strategy.max.bytes.per.group", > Long.MaxValue.toString) > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > String.valueOf(12 *1024 * 1024L)) > .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, > begin_lon") > .mode(SaveMode.Overwrite) > .save(basePath) > step2: check the temp dir, we find > /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty > {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 > {color} > is not cleaned up. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up
[ https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2214: --- Description: residual temporary files after clustering are not cleaned up // test step step1: do clustering val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList val inputDF1: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records1, 2)) inputDF1.write.format("org.apache.hudi") .options(commonOpts) .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) // option for clustering .option("hoodie.parquet.small.file.limit", "0") .option("hoodie.clustering.inline", "true") .option("hoodie.clustering.inline.max.commits", "1") .option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824") .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") .option("hoodie.clustering.plan.strategy.max.bytes.per.group", Long.MaxValue.toString) .option("hoodie.clustering.plan.strategy.target.file.max.bytes", String.valueOf(12 *1024 * 1024L)) .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, begin_lon") .mode(SaveMode.Overwrite) .save(basePath) step2: check the temp dir, we find /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 {color} is not cleaned up. was: residual temporary files after clustering are not cleaned up > residual temporary files after clustering are not cleaned up > > > Key: HUDI-2214 > URL: https://issues.apache.org/jira/browse/HUDI-2214 > Project: Apache Hudi > Issue Type: Bug > Components: Cleaner >Affects Versions: 0.8.0 > Environment: spark3.1.1 > hadoop3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Fix For: 0.10.0 > > > residual temporary files after clustering are not cleaned up > // test step > step1: do clustering > val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList > val inputDF1: Dataset[Row] = > spark.read.json(spark.sparkContext.parallelize(records1, 2)) > inputDF1.write.format("org.apache.hudi") > .options(commonOpts) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > // option for clustering > .option("hoodie.parquet.small.file.limit", "0") > .option("hoodie.clustering.inline", "true") > .option("hoodie.clustering.inline.max.commits", "1") > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > "1073741824") > .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") > .option("hoodie.clustering.plan.strategy.max.bytes.per.group", > Long.MaxValue.toString) > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > String.valueOf(12 *1024 * 1024L)) > .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, > begin_lon") > .mode(SaveMode.Overwrite) > .save(basePath) > step2: check the temp dir, we find > /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty > {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 > {color} > is not cleaned up. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)