[jira] [Assigned] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size
[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2003: Assignee: (was: Raymond Xu) > Auto Compute Compression ratio for input data to output parquet/orc file size > - > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Vinay >Priority: Major > Labels: user-support-issues > Fix For: 0.12.0 > > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size
[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2003: Assignee: Raymond Xu (was: Forward Xu) > Auto Compute Compression ratio for input data to output parquet/orc file size > - > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Vinay >Assignee: Raymond Xu >Priority: Blocker > Labels: user-support-issues > Fix For: 0.11.0 > > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size
[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-2003: Assignee: Forward Xu (was: Alexey Kudinkin) > Auto Compute Compression ratio for input data to output parquet/orc file size > - > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Vinay >Assignee: Forward Xu >Priority: Blocker > Labels: sev:high, user-support-issues > Fix For: 0.11.0 > > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size
[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-2003: Assignee: Alexey Kudinkin > Auto Compute Compression ratio for input data to output parquet/orc file size > - > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Vinay >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: user-support-issues > Fix For: 0.11.0 > > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.20.1#820001)