[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2214:

Fix Version/s: (was: 0.10.0)
   0.9.0

> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-30 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2214:

Status: In Progress  (was: Open)

> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2214:
-
Labels: pull-request-available  (was: )

> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-23 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2214:
---
Description: 
residual temporary files after clustering are not cleaned up

// test step

step1: do clustering

val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
val inputDF1: Dataset[Row] = 
spark.read.json(spark.sparkContext.parallelize(records1, 2))
inputDF1.write.format("org.apache.hudi")
 .options(commonOpts)
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
 // option for clustering
 .option("hoodie.parquet.small.file.limit", "0")
 .option("hoodie.clustering.inline", "true")
 .option("hoodie.clustering.inline.max.commits", "1")
 .option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824")
 .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
 .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
Long.MaxValue.toString)
 .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
String.valueOf(12 *1024 * 1024L))
 .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, begin_lon")
 .mode(SaveMode.Overwrite)
 .save(basePath)

step2: check the temp dir, we find 
/tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty

{color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
 {color}

is not cleaned up.

 

 

 

 

  was:
residual temporary files after clustering are not cleaned up

 

 

 


> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)