[jira] [Updated] (HUDI-5250) Using the default value of estimate record size at the averageBytesPerRecord() when estimation threshold is less than 0
[ https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5250: - Labels: pull-request-available (was: ) > Using the default value of estimate record size at the > averageBytesPerRecord() when estimation threshold is less than 0 > --- > > Key: HUDI-5250 > URL: https://issues.apache.org/jira/browse/HUDI-5250 > Project: Apache Hudi > Issue Type: Improvement >Reporter: XixiHua >Priority: Major > Labels: pull-request-available > > Currently, hudi obtains the average record size based on records written > during previous commits. Used for estimating how many records pack into one > file, and the code is about UpsertPartitioner.averageBytesPerRecord(). > But we found that the single data file could become 600~700M and most other > files are less than 200M. > * Reason > * > ** the result of totalBytesWritten/totalRecordsWritten is very small when > the last commit, but the next commit record is very large, then the data > files will become very large. > * Solve plan > ** Plan1: calculate avgSize of the past several commit not just only one, > but the getCommitMetadata costs a lot of time, then this function might be > slow, so we did not choose this. > ** Plan2: Use the estimated record size considering our data size is fixed > in some sense, more the hudi community did not encourage adding a more > boolean variable to control whether to use the last commit avgSize, then we > use the estimation threshold, when it is less than 0, we use the default > estimate record size. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5250) Using the default value of estimate record size at the averageBytesPerRecord() when estimation threshold is less than 0
[ https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XixiHua updated HUDI-5250: -- Description: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about Upsert{artitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M: * Reason * ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: Use the estimated record size considering our data size is fixed in some sense, more the hudi community did not encourage adding a more boolean variable to control whether to use the last commit avgSize, then we use the estimation threshold, when it is less than 0, we use the default estimate record size. was: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about Upsert{artitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M: * Reason * ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: Use the estimated record size considering our data size is fixed in some sense, more the hudi community did not encourage adding a more boolean variables to control whether to use the last commit avgSize, then we use the thread > Using the default value of estimate record size at the > averageBytesPerRecord() when estimation threshold is less than 0 > --- > > Key: HUDI-5250 > URL: https://issues.apache.org/jira/browse/HUDI-5250 > Project: Apache Hudi > Issue Type: Improvement >Reporter: XixiHua >Priority: Major > > Currently, hudi obtains the average record size based on records written > during previous commits. Used for estimating how many records pack into one > file, and the code is about Upsert{artitioner.averageBytesPerRecord(). > But we found that the single data file could become 600~700M and most other > files are less than 200M: > * Reason > * > ** the result of totalBytesWritten/totalRecordsWritten is very small when > the last commit, but the next commit record is very large, then the data > files will become very large. > * Solve plan > ** Plan1: calculate avgSize of the past several commit not just only one, > but the getCommitMetadata costs a lot of time, then this function might be > slow, so we did not choose this. > ** Plan2: Use the estimated record size considering our data size is fixed > in some sense, more the hudi community did not encourage adding a more > boolean variable to control whether to use the last commit avgSize, then we > use the estimation threshold, when it is less than 0, we use the default > estimate record size. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5250) Using the default value of estimate record size at the averageBytesPerRecord() when estimation threshold is less than 0
[ https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XixiHua updated HUDI-5250: -- Description: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about UpsertPartitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M. * Reason * ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: Use the estimated record size considering our data size is fixed in some sense, more the hudi community did not encourage adding a more boolean variable to control whether to use the last commit avgSize, then we use the estimation threshold, when it is less than 0, we use the default estimate record size. was: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about Upsert{artitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M: * Reason * ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: Use the estimated record size considering our data size is fixed in some sense, more the hudi community did not encourage adding a more boolean variable to control whether to use the last commit avgSize, then we use the estimation threshold, when it is less than 0, we use the default estimate record size. > Using the default value of estimate record size at the > averageBytesPerRecord() when estimation threshold is less than 0 > --- > > Key: HUDI-5250 > URL: https://issues.apache.org/jira/browse/HUDI-5250 > Project: Apache Hudi > Issue Type: Improvement >Reporter: XixiHua >Priority: Major > > Currently, hudi obtains the average record size based on records written > during previous commits. Used for estimating how many records pack into one > file, and the code is about UpsertPartitioner.averageBytesPerRecord(). > But we found that the single data file could become 600~700M and most other > files are less than 200M. > * Reason > * > ** the result of totalBytesWritten/totalRecordsWritten is very small when > the last commit, but the next commit record is very large, then the data > files will become very large. > * Solve plan > ** Plan1: calculate avgSize of the past several commit not just only one, > but the getCommitMetadata costs a lot of time, then this function might be > slow, so we did not choose this. > ** Plan2: Use the estimated record size considering our data size is fixed > in some sense, more the hudi community did not encourage adding a more > boolean variable to control whether to use the last commit avgSize, then we > use the estimation threshold, when it is less than 0, we use the default > estimate record size. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5250) Using the default value of estimate record size at the averageBytesPerRecord() when estimation threshold is less than 0
[ https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XixiHua updated HUDI-5250: -- Description: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about Upsert{artitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M: * Reason * ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: Use the estimated record size considering our data size is fixed in some sense, more the hudi community did not encourage adding a more boolean variables to control whether to use the last commit avgSize, then we use the thread was: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about Upsert{artitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M: * Reason ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: using the > Using the default value of estimate record size at the > averageBytesPerRecord() when estimation threshold is less than 0 > --- > > Key: HUDI-5250 > URL: https://issues.apache.org/jira/browse/HUDI-5250 > Project: Apache Hudi > Issue Type: Improvement >Reporter: XixiHua >Priority: Major > > Currently, hudi obtains the average record size based on records written > during previous commits. Used for estimating how many records pack into one > file, and the code is about Upsert{artitioner.averageBytesPerRecord(). > But we found that the single data file could become 600~700M and most other > files are less than 200M: > * Reason > * > ** the result of totalBytesWritten/totalRecordsWritten is very small when > the last commit, but the next commit record is very large, then the data > files will become very large. > * Solve plan > ** Plan1: calculate avgSize of the past several commit not just only one, > but the getCommitMetadata costs a lot of time, then this function might be > slow, so we did not choose this. > ** Plan2: Use the estimated record size considering our data size is fixed > in some sense, more the hudi community did not encourage adding a more > boolean variables to control whether to use the last commit avgSize, then we > use the thread > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5250) Using the default value of estimate record size at the averageBytesPerRecord() when estimation threshold is less than 0
[ https://issues.apache.org/jira/browse/HUDI-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XixiHua updated HUDI-5250: -- Description: Currently, hudi obtains the average record size based on records written during previous commits. Used for estimating how many records pack into one file, and the code is about Upsert{artitioner.averageBytesPerRecord(). But we found that the single data file could become 600~700M and most other files are less than 200M: * Reason ** the result of totalBytesWritten/totalRecordsWritten is very small when the last commit, but the next commit record is very large, then the data files will become very large. * Solve plan ** Plan1: calculate avgSize of the past several commit not just only one, but the getCommitMetadata costs a lot of time, then this function might be slow, so we did not choose this. ** Plan2: using the > Using the default value of estimate record size at the > averageBytesPerRecord() when estimation threshold is less than 0 > --- > > Key: HUDI-5250 > URL: https://issues.apache.org/jira/browse/HUDI-5250 > Project: Apache Hudi > Issue Type: Improvement >Reporter: XixiHua >Priority: Major > > Currently, hudi obtains the average record size based on records written > during previous commits. Used for estimating how many records pack into one > file, and the code is about Upsert{artitioner.averageBytesPerRecord(). > But we found that the single data file could become 600~700M and most other > files are less than 200M: > * Reason > ** the result of totalBytesWritten/totalRecordsWritten is very small when > the last commit, but the next commit record is very large, then the data > files will become very large. > * Solve plan > ** Plan1: calculate avgSize of the past several commit not just only one, > but the getCommitMetadata costs a lot of time, then this function might be > slow, so we did not choose this. > ** Plan2: using the > -- This message was sent by Atlassian Jira (v8.20.10#820010)