[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7877: -- Status: Patch Available (was: In Progress) > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping
[ https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7841: -- Status: Patch Available (was: In Progress) > RLI and secondary index should consider only pruned partitions for file > skipping > > > Key: HUDI-7841 > URL: https://issues.apache.org/jira/browse/HUDI-7841 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Even though RLI scans only matching files, it tries to get those candidate > files by iterating over all files from file index. See - > [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47] > Instead, it can use the `prunedPartitionsAndFileSlices` to only consider > pruned partitions whenever there is a partition predicate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7877: -- Status: In Progress (was: Open) > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) New Action for Clustering
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7905: -- Status: In Progress (was: Open) > New Action for Clustering > - > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action. This simplifies a few > things such as we do not need to scan the replacecommit.requested to > determine whether we are looking at clustering plan or not. This also > standardizes the usage of replacecommit to some extent (related to HUDI-1739). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping
[ https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7841: -- Status: In Progress (was: Open) > RLI and secondary index should consider only pruned partitions for file > skipping > > > Key: HUDI-7841 > URL: https://issues.apache.org/jira/browse/HUDI-7841 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Even though RLI scans only matching files, it tries to get those candidate > files by iterating over all files from file index. See - > [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47] > Instead, it can use the `prunedPartitionsAndFileSlices` to only consider > pruned partitions whenever there is a partition predicate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7907) Validate new file slicing on table with mix of older and new log files
[ https://issues.apache.org/jira/browse/HUDI-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856090#comment-17856090 ] Danny Chen commented on HUDI-7907: -- The file slicing is based on log file completion time but the log file naming convention has changed, as we discussed, we should do a full compaction before upgrade right? It is not a wise choice to keep compatibility for log file naming resolving because that is a hot spot code path. > Validate new file slicing on table with mix of older and new log files > -- > > Key: HUDI-7907 > URL: https://issues.apache.org/jira/browse/HUDI-7907 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > Log files naming has changed i.e. now we have deltacommit time instead of > base commit time in the log file name. Could there be an edge case that file > slicing could be incorrect if we have a mix of older and new log files within > the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point > to base commit time for older log files, while for newer ones it will point > to deltacommit times. Writes are still serialized because new deltacommit > times must be > base commit time, but we need to test the scenario fully. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7908) HoodieFileGroupReader fails if precombine and partition fields are same
Sagar Sumit created HUDI-7908: - Summary: HoodieFileGroupReader fails if precombine and partition fields are same Key: HUDI-7908 URL: https://issues.apache.org/jira/browse/HUDI-7908 Project: Apache Hudi Issue Type: Bug Reporter: Sagar Sumit Fix For: 1.0.0 {code:java} test(s"Test INSERT INTO with upsert operation type") { if (HoodieSparkUtils.gteqSpark3_2) { withTempDir { tmp => Seq("mor").foreach { tableType => val tableName = generateTableName spark.sql( s""" |create table $tableName ( | id int, | name string, | ts long, | price int |) using hudi |partitioned by (ts) |tblproperties ( | type = '$tableType', | primaryKey = 'id', | preCombineField = 'ts' |) |location '${tmp.getCanonicalPath}/$tableName' |""".stripMargin ) // Test insert into with upsert operation type spark.sql( s""" | insert into $tableName | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 3000, 30), (4, 'a4', 2000, 10), (5, 'a5', 3000, 20), (6, 'a6', 4000, 30) | """.stripMargin ) checkAnswer(s"select id, name, price, ts from $tableName where price>3000")( Seq(6, "a6", 4000, 30) ) // Test update spark.sql(s"update $tableName set price = price + 1 where id = 6") checkAnswer(s"select id, name, price, ts from $tableName where price>3000")( Seq(6, "a6", 4001, 30) ) } } } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Log file slice or grouping detection compatibility 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by
[jira] [Updated] (HUDI-7420) Parallelize the process of constructing `logFilesMarkerPath` in CommitMetadatautils#reconcileMetadataForMissingFiles
[ https://issues.apache.org/jira/browse/HUDI-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7420: -- Sprint: Sprint 2024-03-25, Sprint 2024-04-26, 2024/06/03-16 (was: Sprint 2024-03-25, Sprint 2024-04-26, 2024/06/17-30, 2024/06/03-16) > Parallelize the process of constructing `logFilesMarkerPath` in > CommitMetadatautils#reconcileMetadataForMissingFiles > > > Key: HUDI-7420 > URL: https://issues.apache.org/jira/browse/HUDI-7420 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > This is related to HUDI-1517. > Current logic is: > {code:java} > Set logFilesMarkerPath = new HashSet<>(); > allLogFilesMarkerPath.stream().filter(logFilePath -> > !logFilePath.endsWith("cdc")).forEach(logFilesMarkerPath::add); > // remove valid log files > // TODO: refactor based on HoodieData > for (Map.Entry> partitionAndWriteStats : > commitMetadata.getPartitionToWriteStats().entrySet()) { > for (HoodieWriteStat hoodieWriteStat : partitionAndWriteStats.getValue()) { > logFilesMarkerPath.remove(hoodieWriteStat.getPath()); > } > } {code} > The for loop can be achieved via context.parallelize as below, but need to > check for thread-safety. > {code:java} > Set logFilesMarkerPath = new HashSet<>(); > allLogFilesMarkerPath.stream().filter(logFilePath -> > !logFilePath.endsWith("cdc")).forEach(logFilesMarkerPath::add); > // Convert the partition and write stats to a list of log file paths to be > removed > List validLogFilePaths = context.parallelize(new > ArrayList<>(commitMetadata.getPartitionToWriteStats().entrySet())) > .flatMapToPair((SerializablePairFunction List>, String, Void>) entry -> { > List> pathsToRemove = new ArrayList<>(); > entry.getValue().forEach(hoodieWriteStat -> > pathsToRemove.add(Pair.of(hoodieWriteStat.getPath(), null))); > return pathsToRemove.iterator(); > }) > .map(t -> t.getLeft()) > .collect(); > // Remove the valid log file paths from logFilesMarkerPath in a parallel > manner > // Depending on the specifics of your environment and HoodieEngineContext, > this might need to be adapted. > // For a straightforward approach without parallelization of the remove > operation: > validLogFilePaths.forEach(logFilesMarkerPath::remove); {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
VitoMakarevich commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2177158089 @yihua you can merge if you plan so -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
VitoMakarevich closed pull request #11465: [HUDI-7874] Avro fix read 2 level and 3 level files URL: https://github.com/apache/hudi/pull/11465 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Commit instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Ta
[jira] [Updated] (HUDI-7904) RLI not skipping data files
[ https://issues.apache.org/jira/browse/HUDI-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7904: -- Sprint: 2024/06/17-30 > RLI not skipping data files > --- > > Key: HUDI-7904 > URL: https://issues.apache.org/jira/browse/HUDI-7904 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0-beta2, 1.0.0 > > Attachments: image (9).png > > > Enable RLI on record key field and do a query with record key field equals > predicate: > SELECT id, rider, driver FROM hudi_table WHERE id = 'trip1'; > Spark UI still shows 4 files scanned, however as per the predicate only 1 > file qualifies. > !image (9).png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) New Action for Clustering
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7905: -- Sprint: 2024/06/17-30 > New Action for Clustering > - > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action. This simplifies a few > things such as we do not need to scan the replacecommit.requested to > determine whether we are looking at clustering plan or not. This also > standardizes the usage of replacecommit to some extent (related to HUDI-1739). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7907) Validate new file slicing on table with mix of older and new log files
[ https://issues.apache.org/jira/browse/HUDI-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7907: -- Sprint: 2024/06/17-30 > Validate new file slicing on table with mix of older and new log files > -- > > Key: HUDI-7907 > URL: https://issues.apache.org/jira/browse/HUDI-7907 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > Log files naming has changed i.e. now we have deltacommit time instead of > base commit time in the log file name. Could there be an edge case that file > slicing could be incorrect if we have a mix of older and new log files within > the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point > to base commit time for older log files, while for newer ones it will point > to deltacommit times. Writes are still serialized because new deltacommit > times must be > base commit time, but we need to test the scenario fully. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7903) Partition Stats Index not getting created with SQL
[ https://issues.apache.org/jira/browse/HUDI-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7903: -- Sprint: 2024/06/17-30 > Partition Stats Index not getting created with SQL > -- > > Key: HUDI-7903 > URL: https://issues.apache.org/jira/browse/HUDI-7903 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0-beta2, 1.0.0 > > > {code:java} > spark.sql( > s""" > | create table $tableName using hudi > | partitioned by (dt) > | tblproperties( > |primaryKey = 'id', > |preCombineField = 'ts', > |'hoodie.metadata.index.partition.stats.enable' = 'true' > | ) > | location '$tablePath' > | AS > | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, > cast('2021-05-06' as date) as dt >""".stripMargin > ) {code} > Even when partition stats is enabled, index is not created with SQL. Works > for datasource. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7907) Validate new file slicing on table with mix of older and new log files
[ https://issues.apache.org/jira/browse/HUDI-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7907: -- Fix Version/s: 1.0.0 > Validate new file slicing on table with mix of older and new log files > -- > > Key: HUDI-7907 > URL: https://issues.apache.org/jira/browse/HUDI-7907 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > Log files naming has changed i.e. now we have deltacommit time instead of > base commit time in the log file name. Could there be an edge case that file > slicing could be incorrect if we have a mix of older and new log files within > the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point > to base commit time for older log files, while for newer ones it will point > to deltacommit times. Writes are still serialized because new deltacommit > times must be > base commit time, but we need to test the scenario fully. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7907) Validate new file slicing on table with mix of older and new log files
Sagar Sumit created HUDI-7907: - Summary: Validate new file slicing on table with mix of older and new log files Key: HUDI-7907 URL: https://issues.apache.org/jira/browse/HUDI-7907 Project: Apache Hudi Issue Type: Task Reporter: Sagar Sumit Assignee: Danny Chen Log files naming has changed i.e. now we have deltacommit time instead of base commit time in the log file name. Could there be an edge case that file slicing could be incorrect if we have a mix of older and new log files within the same filegroup. Because the `HoodieLogFile#getDeltaCommitTime` will point to base commit time for older log files, while for newer ones it will point to deltacommit times. Writes are still serialized because new deltacommit times must be > base commit time, but we need to test the scenario fully. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
codope commented on code in PR #11467: URL: https://github.com/apache/hudi/pull/11467#discussion_r1645029522 ## hudi-common/src/main/avro/HoodieMetadata.avsc: ## @@ -427,6 +427,15 @@ "type": "int", "default": 0, "doc": "Represents fileId encoding. Possible values are 0 and 1. O represents UUID based fileID, and 1 represents raw string format of the fileId. \nWhen the encoding is 0, reader can deduce fileID from fileIdLowBits, fileIdHighBits and fileIndex." +}, +{ +"name": "position", +"type": [ +"null", +"long" +], +"default": null, Review Comment: Should we instead have adefault of `-1L`? cc @yihua @nsivabalan That's what we use in `HoodieRecordLocation` - https://github.com/apache/hudi/blob/9f0130442a502bff6d6f7a649a7808a03d51da41/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordLocation.java#L46 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -156,6 +156,7 @@ public class HoodieMetadataPayload implements HoodieRecordPayload createRecordIndexUpdate(String fileIndex, "", instantTimeMillis, - 0)); + 0, + null)); Review Comment: note: might change if we decide to use -1 as default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176896279 ## CI report: * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176806759 ## CI report: * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176794046 ## CI report: * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176793952 ## CI report: * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176780305 ## CI report: * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176780178 ## CI report: * 16afb3d821f1fd35beff26f697016826bcf55491 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7874] Fix Hudi being able to read 2-level structure (#11450)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9f0130442a5 [HUDI-7874] Fix Hudi being able to read 2-level structure (#11450) 9f0130442a5 is described below commit 9f0130442a502bff6d6f7a649a7808a03d51da41 Author: Vitali Makarevich AuthorDate: Tue Jun 18 20:32:35 2024 +0200 [HUDI-7874] Fix Hudi being able to read 2-level structure (#11450) Co-authored-by: vmakarevich --- .../hudi/io/hadoop/HoodieAvroParquetReader.java| 2 +- .../apache/parquet/avro/HoodieAvroReadSupport.java | 23 +- .../hudi/TestParquetReaderCompatibility.scala | 325 + 3 files changed, 344 insertions(+), 6 deletions(-) diff --git a/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java b/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java index dfbf4801687..bf1e4218364 100644 --- a/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java +++ b/hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReader.java @@ -166,7 +166,7 @@ public class HoodieAvroParquetReader extends HoodieAvroFileReader { // NOTE: We have to set both Avro read-schema and projection schema to make // sure that in case the file-schema is not equal to read-schema we'd still // be able to read that file (in case projection is a proper one) -Configuration hadoopConf = storage.getConf().unwrapAs(Configuration.class); +Configuration hadoopConf = storage.getConf().unwrapCopyAs(Configuration.class); if (!requestedSchema.isPresent()) { AvroReadSupport.setAvroReadSchema(hadoopConf, schema); AvroReadSupport.setRequestedProjection(hadoopConf, schema); diff --git a/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java b/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java index 326accb66b2..07015209435 100644 --- a/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java +++ b/hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroReadSupport.java @@ -46,11 +46,7 @@ public class HoodieAvroReadSupport extends AvroReadSupport { @Override public ReadContext init(Configuration configuration, Map keyValueMetaData, MessageType fileSchema) { boolean legacyMode = checkLegacyMode(fileSchema.getFields()); -// support non-legacy list -if (!legacyMode && configuration.get(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE) == null) { - configuration.set(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, - "false", "support reading avro from non-legacy map/list in parquet file"); -} +adjustConfToReadWithFileProduceMode(legacyMode, configuration); ReadContext readContext = super.init(configuration, keyValueMetaData, fileSchema); MessageType requestedSchema = readContext.getRequestedSchema(); // support non-legacy map. Convert non-legacy map to legacy map @@ -62,6 +58,23 @@ public class HoodieAvroReadSupport extends AvroReadSupport { return new ReadContext(requestedSchema, readContext.getReadSupportMetadata()); } + /** + * Here we want set config with which file has been written. + * Even though user may have overwritten {@link AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE}, + * it's only applicable to how to produce new files(here is a read path). + * Later the config value {@link AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE} will still be used + * to write new file according to the user preferences. + **/ + private void adjustConfToReadWithFileProduceMode(boolean isLegacyModeWrittenFile, Configuration configuration) { +if (isLegacyModeWrittenFile) { + configuration.set(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, + "true", "support reading avro from legacy map/list in parquet file"); +} else { + configuration.set(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, + "false", "support reading avro from non-legacy map/list in parquet file"); +} + } + /** * Check whether write map/list with legacy mode. * legacy: diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestParquetReaderCompatibility.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestParquetReaderCompatibility.scala new file mode 100644 index 000..c5f91657f12 --- /dev/null +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestParquetReaderCompatibility.scala @@ -0,0 +1,325 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses th
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
yihua merged PR #11450: URL: https://github.com/apache/hudi/pull/11450 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] add show create table command [hudi]
hudi-bot commented on PR #11471: URL: https://github.com/apache/hudi/pull/11471#issuecomment-2176612950 ## CI report: * c472e2ad91d62204b38aa15d92fe60ca528b6275 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24455) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] add show create table command [hudi]
hudi-bot commented on PR #11471: URL: https://github.com/apache/hudi/pull/11471#issuecomment-2176599395 ## CI report: * c472e2ad91d62204b38aa15d92fe60ca528b6275 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]
hudi-bot commented on PR #11470: URL: https://github.com/apache/hudi/pull/11470#issuecomment-2176584962 ## CI report: * c4bf9390e01ce6cca6627d5d8c592413121386c2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] add show create table command [hudi]
houyuting opened a new pull request, #11471: URL: https://github.com/apache/hudi/pull/11471 ### Change Logs add show create table command feature ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7876] use properties to store log spill map configs for fg reader (#11455)
This is an automated email from the ASF dual-hosted git repository. jonvex pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new bf1df335442 [HUDI-7876] use properties to store log spill map configs for fg reader (#11455) bf1df335442 is described below commit bf1df335442d38932cf7f8c6ef4228c316278569 Author: Jon Vexler AuthorDate: Tue Jun 18 12:30:56 2024 -0400 [HUDI-7876] use properties to store log spill map configs for fg reader (#11455) * use properties to store log spill map configs for fg reader * use constant for the max buffer size * rename payloadProps to props - Co-authored-by: Jonathan Vexler <=> --- .../read/HoodieBaseFileGroupRecordBuffer.java | 46 +- .../common/table/read/HoodieFileGroupReader.java | 13 ++ .../read/HoodieKeyBasedFileGroupRecordBuffer.java | 10 + .../HoodiePositionBasedFileGroupRecordBuffer.java | 10 + .../read/HoodieUnmergedFileGroupRecordBuffer.java | 10 + .../table/read/TestHoodieFileGroupReaderBase.java | 9 +++-- .../reader/HoodieFileGroupReaderTestUtils.java | 12 +++--- .../HoodieFileGroupReaderBasedRecordReader.java| 24 +-- ...odieFileGroupReaderBasedParquetFileFormat.scala | 15 +++ ...stHoodiePositionBasedFileGroupRecordBuffer.java | 11 +++--- 10 files changed, 68 insertions(+), 92 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java index 88ec42673ac..aea50e44fbe 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java @@ -29,6 +29,7 @@ import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.log.KeySpec; import org.apache.hudi.common.table.log.block.HoodieDataBlock; import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.FileIOUtils; import org.apache.hudi.common.util.HoodieRecordSizeEstimator; import org.apache.hudi.common.util.InternalSchemaCache; import org.apache.hudi.common.util.Option; @@ -50,9 +51,14 @@ import java.io.IOException; import java.io.Serializable; import java.util.Collections; import java.util.Iterator; +import java.util.Locale; import java.util.Map; import java.util.function.Function; +import static org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED; +import static org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE; +import static org.apache.hudi.common.config.HoodieMemoryConfig.MAX_MEMORY_FOR_MERGE; +import static org.apache.hudi.common.config.HoodieMemoryConfig.SPILLABLE_MAP_BASE_PATH; import static org.apache.hudi.common.engine.HoodieReaderContext.INTERNAL_META_SCHEMA; import static org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.INSTANT_TIME; import static org.apache.hudi.common.table.read.HoodieFileGroupReader.getRecordMergeMode; @@ -64,7 +70,7 @@ public abstract class HoodieBaseFileGroupRecordBuffer implements HoodieFileGr protected final Option partitionPathFieldOpt; protected final RecordMergeMode recordMergeMode; protected final HoodieRecordMerger recordMerger; - protected final TypedProperties payloadProps; + protected final TypedProperties props; protected final ExternalSpillableMap, Map>> records; protected ClosableIterator baseFileIterator; protected Iterator, Map>> logRecordIterator; @@ -78,24 +84,26 @@ public abstract class HoodieBaseFileGroupRecordBuffer implements HoodieFileGr Option partitionNameOverrideOpt, Option partitionPathFieldOpt, HoodieRecordMerger recordMerger, - TypedProperties payloadProps, - long maxMemorySizeInBytes, - String spillableMapBasePath, - ExternalSpillableMap.DiskMapType diskMapType, - boolean isBitCaskDiskMapCompressionEnabled) { + TypedProperties props) { this.readerContext = readerContext; this.readerSchema = readerContext.getSchemaHandler().getRequiredSchema(); this.partitionNameOverrideOpt = partitionNameOverrideOpt; this.partitionPathFieldOpt = partitionPathFieldOpt; -this.recordMergeMode = getRecordMergeMode(payloadProps); +this.recordMergeMode = getRecordMergeMode(props); this.recordMerger = recordMerger; //Custom merge
Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]
jonvex merged PR #11455: URL: https://github.com/apache/hudi/pull/11455 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]
hudi-bot commented on PR #11470: URL: https://github.com/apache/hudi/pull/11470#issuecomment-2176505908 ## CI report: * c4bf9390e01ce6cca6627d5d8c592413121386c2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]
hudi-bot commented on PR #11470: URL: https://github.com/apache/hudi/pull/11470#issuecomment-2176490891 ## CI report: * c4bf9390e01ce6cca6627d5d8c592413121386c2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176490732 ## CI report: * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]
soumilshah1995 commented on issue #11469: URL: https://github.com/apache/hudi/issues/11469#issuecomment-2176470646 the error indicates that Spark cannot find the Hudi data source (org.apache.hudi.DefaultSource), which typically means the required Hudi jar is not properly included or recognized by Spark please ensure you are using right jar files with right version of spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7906) improve the parallelism deduce in rdd write
[ https://issues.apache.org/jira/browse/HUDI-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7906: - Labels: pull-request-available (was: ) > improve the parallelism deduce in rdd write > --- > > Key: HUDI-7906 > URL: https://issues.apache.org/jira/browse/HUDI-7906 > Project: Apache Hudi > Issue Type: Improvement >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > as [https://github.com/apache/hudi/issues/11274] and > [https://github.com/apache/hudi/pull/11463] describe, there has two case > question. > # if the rdd is input rdd without shuffle, the partitiion number is too > bigger or too small > # user need can not control it easy > ## in some case user can set `spark.default.parallelism` change it. > ## in some case user can not change because hard-code > ## and in spark, the better way is use `spark.default.parallelism` or > `spark.sql.shuffle.partitions` can control it, other is advanced in hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]
KnightChess opened a new pull request, #11470: URL: https://github.com/apache/hudi/pull/11470 ### Change Logs as https://github.com/apache/hudi/issues/11274 and https://github.com/apache/hudi/pull/11463 describe, there has two case question. - if the rdd is input rdd without shuffle, the partitiion number is too bigger or too small - user need can not control it easy - in some case user can set `spark.default.parallelism` change it. - in some case user can not change because hard-code - and in spark, the better way is use `spark.default.parallelism` or `spark.sql.shuffle.partitions` can control it, other is advanced in hudi. ### Impact like dedup where use new deduce logical, user can use `spark.sql.shuffle.partitions` or `spark.default.parallelism` control the parallelism. For special scenes, also can use advanced params. ### Risk level (write none, low medium or high below) low ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7906) improve the parallelism deduce in rdd write
KnightChess created HUDI-7906: - Summary: improve the parallelism deduce in rdd write Key: HUDI-7906 URL: https://issues.apache.org/jira/browse/HUDI-7906 Project: Apache Hudi Issue Type: Improvement Reporter: KnightChess Assignee: KnightChess as [https://github.com/apache/hudi/issues/11274] and [https://github.com/apache/hudi/pull/11463] describe, there has two case question. # if the rdd is input rdd without shuffle, the partitiion number is too bigger or too small # user need can not control it easy ## in some case user can set `spark.default.parallelism` change it. ## in some case user can not change because hard-code ## and in spark, the better way is use `spark.default.parallelism` or `spark.sql.shuffle.partitions` can control it, other is advanced in hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]
hudi-bot commented on PR #11455: URL: https://github.com/apache/hudi/pull/11455#issuecomment-2176358198 ## CI report: * 17dfb2b314e57e251901611d90d35231c701f167 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24436) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error [hudi]
ankit0811 commented on issue #11466: URL: https://github.com/apache/hudi/issues/11466#issuecomment-2176336217 Hmm. I dint find any relevant errors in the tm logs. Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
hudi-bot commented on PR #11467: URL: https://github.com/apache/hudi/pull/11467#issuecomment-2176339072 ## CI report: * a7f2ee34e98381ad9afa7c6dfa634aace8b3546b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24450) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176338861 ## CI report: * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]
hudi-bot commented on PR #11455: URL: https://github.com/apache/hudi/pull/11455#issuecomment-2176338743 ## CI report: * 17dfb2b314e57e251901611d90d35231c701f167 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7876] use properties to store log spill map configs for fg reader [hudi]
jonvex commented on PR #11455: URL: https://github.com/apache/hudi/pull/11455#issuecomment-2176309813 ![image](https://github.com/apache/hudi/assets/26940621/bd4f9a9f-6fce-4c5f-872c-ce08db9654a6) CI passing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]
ROOBALJINDAL commented on issue #11469: URL: https://github.com/apache/hudi/issues/11469#issuecomment-2176309379 @soumilshah1995 thanks for replying. I know how to use streamer in emr serverless, dont need tutorial. Can you please help me regariding this particular exception? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
VitoMakarevich commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176266302 > Also I see that this block > > ``` > if (!legacyMode) { > requestedSchema = new MessageType(requestedSchema.getName(), convertLegacyMap(requestedSchema.getFields())); > } > ``` > > is redundant - since in all the cases `requestedSchema` fetched after tuning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE equals to schema coming from this block - meaning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE does the correct conversion. So this code block should be removed to not cause confusion Sorry -this is incorrect, it looks like Spark 3.1 fails without this block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
hudi-bot commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176234693 ## CI report: * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN * c814458b48d8a33a2b5ebbb0355183e129be89f4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24449) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176195199 ## CI report: * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447) * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24452) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176195058 ## CI report: * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446) * 16afb3d821f1fd35beff26f697016826bcf55491 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24451) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
hudi-bot commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176194891 ## CI report: * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445) * c814458b48d8a33a2b5ebbb0355183e129be89f4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24449) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176093947 ## CI report: * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447) * 753044eb444c4f022f7bd5045a2e6df8d7dda4b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
hudi-bot commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2176093552 ## CI report: * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445) * c814458b48d8a33a2b5ebbb0355183e129be89f4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
hudi-bot commented on PR #11467: URL: https://github.com/apache/hudi/pull/11467#issuecomment-2176094037 ## CI report: * df180b77664cb45e434dec4982d31ff70e3dac3c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24448) * a7f2ee34e98381ad9afa7c6dfa634aace8b3546b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24450) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2176093779 ## CI report: * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446) * 16afb3d821f1fd35beff26f697016826bcf55491 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
hudi-bot commented on PR #11467: URL: https://github.com/apache/hudi/pull/11467#issuecomment-2176075195 ## CI report: * df180b77664cb45e434dec4982d31ff70e3dac3c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24448) * a7f2ee34e98381ad9afa7c6dfa634aace8b3546b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]
soumilshah1995 commented on issue #11469: URL: https://github.com/apache/hudi/issues/11469#issuecomment-2176057015 EMR here is guide to follow https://youtu.be/jvbHUl9A4tQ?si=l7AdUR4vmr_5sDIq Running Apache Hudi Delta Streamer On EMR Serverless Hands on Lab step by step guide for beginners ![1](https://user-images.githubusercontent.com/39345855/229940404-f3efeaae-6e5b-446b-a229-b1fb86e4ea2b.JPG) ## Video based guide * https://www.youtube.com/watch?v=jvbHUl9A4tQ&feature=youtu.be # Steps ## Step 1: Download the sample Parquet files from the links * https://drive.google.com/drive/folders/1BwNEK649hErbsWcYLZhqCWnaXFX3mIsg?usp=share_link Uplaod to S3 Folder as shown in diagram ![image](https://user-images.githubusercontent.com/39345855/229939875-6f2f22ae-c792-4904-bcf8-b1e53ce1e122.png) ## Step 2: Start EMR Serverless Cluster ![image](https://user-images.githubusercontent.com/39345855/229940052-29f6e2a8-9568-4100-8a1b-e988c405f505.png) ![image](https://user-images.githubusercontent.com/39345855/229940099-cf002f04-18f8-4d26-8d89-d512e96bef76.png) ![image](https://user-images.githubusercontent.com/39345855/229940131-836414cf-a85f-4b9f-b1d6-c36115d335c2.png) # Step 3 Run Python Code to submit Job * Please change nd edit the varibales ``` try: import json import uuid import os import boto3 from dotenv import load_dotenv load_dotenv(".env") except Exception as e: pass global AWS_ACCESS_KEY global AWS_SECRET_KEY global AWS_REGION_NAME AWS_ACCESS_KEY = os.getenv("DEV_ACCESS_KEY") AWS_SECRET_KEY = os.getenv("DEV_SECRET_KEY") AWS_REGION_NAME = os.getenv("DEV_REGION") client = boto3.client("emr-serverless", aws_access_key_id=AWS_ACCESS_KEY, aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION_NAME) def lambda_handler_test_emr(event, context): # --Hudi settings - glue_db = "hudi_db" table_name = "invoice" op = "UPSERT" table_type = "COPY_ON_WRITE" record_key = 'invoiceid' precombine = "replicadmstimestamp" partition_feild = 'destinationstate' source_ordering_field = 'replicadmstimestamp' delta_streamer_source = 's3:///raw' hudi_target_path = 's3://X/hudi' # - # EMR # ApplicationId = "XXX" ExecutionTime = 600 ExecutionArn = "XX" JobName = 'delta_streamer_{}'.format(table_name) # spark_submit_parameters = ' --conf spark.jars=/usr/lib/hudi/hudi-utilities-bundle.jar' spark_submit_parameters += ' --conf spark.serializer=org.apache.spark.serializer.KryoSerializer' spark_submit_parameters += ' --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' spark_submit_parameters += ' --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' spark_submit_parameters += ' --conf spark.sql.hive.convertMetastoreParquet=false' spark_submit_parameters += ' --conf mapreduce.fileoutputcommitter.marksuccessfuljobs=false' spark_submit_parameters += ' --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory' spark_submit_parameters += ' --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer' arguments = [ "--table-type", table_type, "--op", op, "--enable-sync", "--source-ordering-field", source_ordering_field, "--source-class", "org.apache.hudi.utilities.sources.ParquetDFSSource", "--target-table", table_name, "--target-base-path", hudi_target_path, "--payload-class", "org.apache.hudi.common.model.AWSDmsAvroPayload", "--hoodie-conf", "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator", "--hoodie-conf", "hoodie.datasource.write.recordkey.field={}".format(record_key), "--hoodie-conf", "hoodie.datasource.write.partitionpath.field={}".format(partition_feild), "--hoodie-conf", "hoodie.deltastreamer.source.dfs.root={}".format(delta_streamer_source), "--hoodie-conf", "hoodie.datasource.write.precombine.field={}".format(precombine), "--hoodie-conf", "hoodie.database.name={}".format(glue_db), "--
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2176056872 ## CI report: * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Data deduplication caused by drawback in the delete invalid files before commit [hudi]
nsivabalan commented on issue #11419: URL: https://github.com/apache/hudi/issues/11419#issuecomment-2176054381 is the main reason, diff file system schemes treat file not found differently during fs.delete()? and you are proposing HoodieStorage#deleteFile to unify that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
hudi-bot commented on PR #11467: URL: https://github.com/apache/hudi/pull/11467#issuecomment-2175969354 ## CI report: * df180b77664cb45e434dec4982d31ff70e3dac3c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24448) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource after hudi upgraded to 6.15 [hudi]
ROOBALJINDAL opened a new issue, #11469: URL: https://github.com/apache/hudi/issues/11469 **Describe the problem you faced** We are creating empty hudi tables from java as follows ``` Dataset emptyDF = spark.createDataFrame(new ArrayList(), schemaStruct); emptyDF.write() .format("org.apache.hudi") .options(tableConf.getHudiOptions()) .mode(SaveMode.Append) .save(); ``` Spark conf: ``` entryPoint: /hudi/hudi-addon-edfx.jar sparkParamsArguments = ["--class com.edifecs.em.cloud.hudi.setup.PreCreateEmptyTablesInHudi", "--conf spark.jars=/usr/lib/hudi/hudi-utilities-bundle.jar", "--conf spark.executor.instances=0", "--conf spark.executor.memory=4g", "--conf spark.driver.memory=4g", "--conf spark.driver.cores=4", "--conf spark.dynamicAllocation.initialExecutors=1" ``` This used to work fine but suddenly stopped working after hudi upgraded from 13.1 to 14.0 (Emr upgraded from 6.12 to 6.15) I refered to similar issue: [](https://github.com/apache/hudi/issues/2997) I also added hudi-spark3-bundle_2.12-0.14.0.jar to the spark.jars but it didnt work. Dont know why it is not able to find this class. **Environment Description** * Hudi version : 14.0 * AWS EMR version : 6.15 **Stacktrace** ```24/06/18 12:02:18 ERROR PreCreateEmptyTablesInHudi: Exception encountered while generating table ehcpencountererror : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: org.apache.hudi. Please find packages at `https://spark.apache.org/third-party-projects.html`. at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:739) ~[spark-catalyst_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:697) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:860) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at com.edifecs.em.cloud.hudi.setup.PreCreateEmptyTablesInHudi.lambda$main$0(PreCreateEmptyTablesInHudi.java:170) ~[?:?] at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?] at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?] at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) ~[?:?] at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754) ~[?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) ~[?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) ~[?:?] at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) ~[?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) ~[?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) ~[?:?] Caused by: java.lang.ClassNotFoundException: org.apache.hudi.DefaultSource at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) ~[?:?] at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) ~[?:?] at java.lang.ClassLoader.loadClass(ClassLoader.java:525) ~[?:?] at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:633) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.15.jar:?] at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:633) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] at scala.util.Failure.orElse(Try.scala:224) ~[scala-library-2.12.15.jar:?] at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:633) ~[spark-sql_2.12-3.4.1-amzn-2.jar:3.4.1-amzn-2] ... 15 more``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: c
Re: [PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
hudi-bot commented on PR #11467: URL: https://github.com/apache/hudi/pull/11467#issuecomment-2175953221 ## CI report: * df180b77664cb45e434dec4982d31ff70e3dac3c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] SqlQueryBasedTransformer new field issue with PostgresDebeziumSource [hudi]
soumilshah1995 commented on issue #11468: URL: https://github.com/apache/hudi/issues/11468#issuecomment-2175922997 slack chat Thread https://apache-hudi.slack.com/archives/C4D716NPQ/p1718691646054409 this issue arises when attempting to use transformer with --source-class org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --hoodie-conf 'hoodie.deltastreamer.transformer.sql=SELECT * FROM ' Throws an issue even when using Select * -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
ashwinagalcha-ps opened a new issue, #11468: URL: https://github.com/apache/hudi/issues/11468 When using Kafka + Debezium + Streamer, we are able to write data and the job works fine, but when using the SqlQueryBasedTransformer, it is able to write data on S3 with the new field but ultimately the job fails. Below are the Hudi Deltastreamer job configs: ```"--table-type", "COPY_ON_WRITE", "--source-class", "org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource", "--transformer-class", "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer", "--hoodie-conf", "hoodie.streamer.transformer.sql=SELECT *, extract(year from a.created_at) as year FROM a", "--source-ordering-field", output["source_ordering_field"], "--target-base-path", f"s3a://{env_params['deltastreamer_bucket']}/{db_name}/{schema}/{output['table_name']}/", "--target-table", output["table_name"], "--auto.offset.reset=earliest "--props", properties_file, "--payload-class", "org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload", "--enable-hive-sync", "--hoodie-conf", "hoodie.datasource.hive_sync.mode=hms", "--hoodie-conf", "hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true", "--hoodie-conf", f"hoodie.deltastreamer.source.kafka.topic={connector_name}.{schema}.{output['table_name']}", "--hoodie-conf", f"schema.registry.url={env_params['schema_registry_url']}", "--hoodie-conf", f"hoodie.deltastreamer.schemaprovider.registry.url={env_params['schema_registry_url']}/subjects/{connector_name}.{schema}.{output['table_name']}-value/versions/latest", "--hoodie-conf", "hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer", "--hoodie-conf", "hoodie.datasource.hive_sync.use_jdbc=false", "--hoodie-conf", f"hoodie.datasource.hive_sync.database={output['hive_database']}", "--hoodie-conf", f"hoodie.datasource.hive_sync.table={output['table_name']}", "--hoodie-conf", "hoodie.datasource.hive_sync.metastore.uris=", "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true", "--hoodie-conf", "hoodie.datasource.hive_sync.support_timestamp=true", "--hoodie-conf", "hoodie.deltastreamer.source.kafka.maxEvents=10", "--hoodie-conf", f"hoodie.datasource.write.recordkey.field={output['record_key']}", "--hoodie-conf", f"hoodie.datasource.write.precombine.field={output['precombine_field']}", "--hoodie-conf", f"hoodie.datasource.hive_sync.glue_database={output['hive_database']}", "--continuous"``` Properties file: ```bootstrap.servers= auto.offset.reset=earliest schema.registry.url=http://host:8081``` **Expected behavior**: To be able to extract a new field (year) in the target hudi table with the help of SqlQueryBasedTransformer. **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hadoop version : 3.3.4 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no * Base image & jars: `public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21` `https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.14.0/hudi-utilities-bundle_2.12-0.14.0.jar` **Stacktrace** ```2024-06-14T14:16:17.562738557Z 24/06/14 14:16:17 ERROR HoodieStreamer: Shutting down delta-sync due to exception 2024-06-14T14:16:17.562785897Z org.apache.hudi.utilities.exception.HoodieTransformExecutionException: Failed to apply sql query based transformer 2024-06-14T14:16:17.562797467Z at org.apache.hudi.utilities.transform.SqlQueryBasedTransformer.apply(SqlQueryBasedTransformer.java:68) 2024-06-14T14:16:17.562805097Z at org.apache.hudi.utilities.transform.ChainedTransformer.apply(ChainedTransformer.java:105) 2024-06-14T14:16:17.562812197Z at org.apache.hudi.utilities.streamer.StreamSync.lambda$fetchFromSource$0(StreamSync.java:530) 2024-06-14T14:16:17.562819517Z at org.apache.hudi.common.util.Option.map(Option.java:108) 2024-06-14T14:16:17.562826327Z at org.apache.hudi.utilities.streamer.StreamSync.fetchFromSource(StreamSync.java:530) 2024-06-14T14:16:17.562836838Z at org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:495) 2024-06-14T14:16:17.562844648Z at org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:405) 2024-06-14T14:16:17.562852958Z at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:757) 2024-06-14T14:16:17.562860358Z at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) 2024-06-14T14:16:17.562868059Z at j
[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7877: - Labels: pull-request-available (was: ) > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7877] Add record position to record index metadata payload [hudi]
lokeshj1703 opened a new pull request, #11467: URL: https://github.com/apache/hudi/pull/11467 ### Change Logs RLI should save the record position so that can be used in the index lookup. This PR adds a position field in the RLI metadata to track the same. ### Impact NA ### Risk level (write none, low medium or high below) low ### Documentation Update NA ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2175829345 ## CI report: * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
hudi-bot commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175829244 ## CI report: * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
hudi-bot commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175726334 ## CI report: * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN * 93ddf4151fedb0698e0c5c56d69b4b866626d393 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24440) * 0dbffbf92f2cb18861621be3e216e65a03129cf6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24445) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175726559 ## CI report: * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441) * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2175726457 ## CI report: * d538bb2c8d4ba5a8da23034338b080f33d132888 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24446) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
hudi-bot commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175709698 ## CI report: * 5f876303cc2ed8f203f8db9f3dea972e3a28f0b7 UNKNOWN * 93ddf4151fedb0698e0c5c56d69b4b866626d393 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24440) * 0dbffbf92f2cb18861621be3e216e65a03129cf6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175710030 ## CI report: * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441) * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) * 529cf1aad669ec04f018d0ad0f176e7aebd42bf7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Ensure Parquet can interoperate different level structures [hudi]
hudi-bot commented on PR #11461: URL: https://github.com/apache/hudi/pull/11461#issuecomment-2175709905 ## CI report: * d538bb2c8d4ba5a8da23034338b080f33d132888 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4096) Sync timeline from embedded timeline server in flink pipline
[ https://issues.apache.org/jira/browse/HUDI-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855845#comment-17855845 ] Qijun Fu commented on HUDI-4096: I think this [pr |https://github.com/apache/hudi/pull/9651] have already fixed this? > Sync timeline from embedded timeline server in flink pipline > > > Key: HUDI-4096 > URL: https://issues.apache.org/jira/browse/HUDI-4096 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: sivabalan narayanan >Assignee: Danny Chen >Priority: Major > > At present, in the Flink-Hudi pipeline, each task will scan the meta > directory to obtain the latest timeline, which will cause frequent get > listing operations on HDFS and cause a lot of pressure. > A proposal is we can modify the way to get the timeline in the Flink-Hudi > pipeline and pull the active timeline through the embedded timeline server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7892] Building workload support set parallelism [hudi]
danny0405 commented on code in PR #11463: URL: https://github.com/apache/hudi/pull/11463#discussion_r1644107676 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java: ## @@ -157,6 +157,9 @@ public HoodieWriteMetadata> execute(HoodieData> inputRecordsWithClusteringUpdate = clusteringHandleUpdate(inputRecords); +if (config.getBuildWorkloadParallelism() > 0) { + inputRecordsWithClusteringUpdate = inputRecordsWithClusteringUpdate.repartition(config.getBuildWorkloadParallelism()); Review Comment: > and this is an existing question in other logical, add new param is not friendly to user, I have a pr will optimizing the inference problem of partitions and it can also solve your problem This would be nice, the fix looks promising. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175617946 ## CI report: * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441) * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Fail to add default partition [hudi]
danny0405 commented on issue #10154: URL: https://github.com/apache/hudi/issues/10154#issuecomment-2175570868 Thanks for the feedback @CaesarWangX , do you try HMS as the sync mode then, the 1st is unexpected and should be a bug, the motive is to keep sync with Hive for default partition name, but now it causes problems reported by Hive. For the 2nd, there might be no easy way to be compatible with history data set because partition path is hotspot code path and we might not consider the ramifications for history values for each record. If you uses the Flink for ingestion, there is config option named `partition.default_name` to switch to other default value as needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Fix Hudi being able to read 2-level structure [hudi]
VitoMakarevich commented on PR #11450: URL: https://github.com/apache/hudi/pull/11450#issuecomment-2175605541 Also I see that this block ``` if (!legacyMode) { requestedSchema = new MessageType(requestedSchema.getName(), convertLegacyMap(requestedSchema.getFields())); } ``` is redundant - since in all the cases `requestedSchema` fetched after tuning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE equals to schema coming from this block - meaning AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE does the correct conversion. So this code block should be removed to not cause confusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7874] Avro fix read 2 level and 3 level files [hudi]
hudi-bot commented on PR #11465: URL: https://github.com/apache/hudi/pull/11465#issuecomment-2175600718 ## CI report: * 6d56ddcb8eae62dcd16b180616269a59afa9df28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24441) * 4e0ba2e3bc68df9bb6b0de2a50f60ba86fa68508 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error [hudi]
danny0405 commented on issue #11466: URL: https://github.com/apache/hudi/issues/11466#issuecomment-2175579240 I see you put the option `options.put(FlinkOptions.IGNORE_FAILED.key(), "false");`, it looks like there is error for parquet writers which is collected back to the coordinator, so it reports error when committing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() // to branch-0.x [hudi]
hudi-bot commented on PR #11437: URL: https://github.com/apache/hudi/pull/11437#issuecomment-2175474243 ## CI report: * 520894319e26fca9b1b28a513be1273aba13edb9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24443) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6286) Overwrite mode should not delete old data
[ https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855820#comment-17855820 ] Geser Dugarov commented on HUDI-6286: - Note that in HoodieWriteUtils.validateTableConfig() we skip all conflicts check between new and existing table configurations if it's an Overwrite save mode. > Overwrite mode should not delete old data > - > > Key: HUDI-6286 > URL: https://issues.apache.org/jira/browse/HUDI-6286 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Reporter: Hui An >Assignee: Hui An >Priority: Major > Fix For: 1.1.0 > > > https://github.com/apache/hudi/pull/8076/files#r1127283648 > For *Overwrite* mode, we should not delete the basePath. Just overwrite the > existing data -- This message was sent by Atlassian Jira (v8.20.10#820010)