[GitHub] [hudi] codecov-io commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
codecov-io commented on pull request #2634: URL: https://github.com/apache/hudi/pull/2634#issuecomment-791203944 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=h1) Report > Merging [#2634](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=desc) (ccf9a8f) into [master](https://codecov.io/gh/apache/hudi/commit/899ae70fdb70c1511c099a64230fd91b2fe8d4ee?el=desc) (899ae70) will **increase** coverage by `10.05%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2634/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2634 +/- ## = + Coverage 51.58% 61.64% +10.05% + Complexity 3285 325 -2960 = Files 446 53 -393 Lines 20409 1963-18446 Branches 2116 235 -1881 = - Hits 10528 1210 -9318 + Misses 9003 630 -8373 + Partials878 123 -755 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `61.64% <ø> (-7.81%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==) | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | | | [...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh) | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | | | [...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh) | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | | | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `41.86% <0.00%> (-22.68%)` | `27.00% <0.00%> (-6.00%)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.34% <0.00%> (-0.37%)` | `53.00% <0.00%> (+1.00%)` | :arrow_down: | | [...nal/HoodieBulkInsertDataInternalWriterFactory.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmsyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2ludGVybmFsL0hvb2RpZUJ1bGtJbnNlcnREYXRhSW50ZXJuYWxXcml0ZXJGYWN0b3J5LmphdmE=) | | | | | [.../apache/hudi/common/fs/TimedFSDataInputStream.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1RpbWVkRlNEYXRhSW5wdXRTdHJlYW0uamF2YQ==) | | | | | [...g/apache/hudi/exception/HoodieRemoteException.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVJlbW90ZUV4Y2VwdGlvbi5qYXZh) | | | | | [...a/org/apache/hudi/common/util/ValidationUtils.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvVmFsaWRhdGlvblV0aWxzLmphdmE=) | | | | | ... and [390
[GitHub] [hudi] garyli1019 commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
garyli1019 commented on pull request #2634: URL: https://github.com/apache/hudi/pull/2634#issuecomment-791202125 @xiarixiaoyao your force push already triggered the CI. Do you mean JIRA contributor access? If so, would you send an email to the dev mailing list with your JIRA ID? That's how we usually grant access to new contributors. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
xiarixiaoyao commented on pull request #2634: URL: https://github.com/apache/hudi/pull/2634#issuecomment-791188663 Thank you @garyli1019 but I don't have permission to trigger CI, could you help. BYT, could you give me the contributor permission? thank you This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2635: [WIP] [DO NOT MERGE] sample code for measuring hfile performance with column ranges
satishkotha commented on a change in pull request #2635: URL: https://github.com/apache/hudi/pull/2635#discussion_r588059204 ## File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/storage/TestHoodieFileWriterFactory.java ## @@ -64,4 +80,246 @@ public void testGetFileWriter() throws IOException { }, "should fail since log storage writer is not supported yet."); assertTrue(thrown.getMessage().contains("format not supported yet.")); } + + // key: full file path (/tmp/.../partition/file-000.parquet, value: column range + @Test + public void testPerformanceRangeKeyPartitionFile() throws IOException { +final String instantTime = "100"; +final HoodieWriteConfig cfg = getConfig(); +final Path hfilePath = new Path(basePath + "/hfile_partition/f1_1-0-1_000.hfile"); +HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient); +HoodieHFileWriter hfileWriter = (HoodieHFileWriter) HoodieFileWriterFactory.getFileWriter(instantTime, +hfilePath, table, cfg, HoodieMetadataRecord.SCHEMA$, supplier); +Random random = new Random(); + +int numPartitions = 1000; +int avgFilesPerPartition = 10; + +long startTime = System.currentTimeMillis(); +List partitions = new ArrayList<>(); +for (int i = 0; i < numPartitions; i++) { + String partitionPath = "partition-" + String.format("%010d", i); + partitions.add(partitionPath); + for (int j = 0; j < avgFilesPerPartition; j++) { +String filePath = "file-" + String.format("%010d", j) + "_1-0-1_000.parquet"; +int max = random.nextInt(); +if (max < 0) { + max = -max; +} +int min = random.nextInt(max); + +HoodieKey key = new HoodieKey(partitionPath + filePath, partitionPath); +GenericRecord rec = new GenericData.Record(HoodieMetadataRecord.SCHEMA$); +rec.put("key", key.getRecordKey()); +rec.put("type", 2); +rec.put("rangeIndexMetadata", HoodieRangeIndexInfo.newBuilder().setMax("" + max).setMin("" + min).setIsDeleted(false).build()); +hfileWriter.writeAvro(key.getRecordKey(), rec); + } +} + +hfileWriter.close(); +long durationInMs = System.currentTimeMillis() - startTime; +System.out.println("Time taken to generate & write: " + durationInMs + " ms. file path: " + hfilePath ++ " File size: " + FSUtils.getFileSize(metaClient.getFs(), hfilePath)); + +CacheConfig cacheConfig = new CacheConfig(hadoopConf); +cacheConfig.setCacheDataInL1(false); +HoodieHFileReader reader = new HoodieHFileReader(hadoopConf, hfilePath, cacheConfig); +long duration = 0; +int numRuns = 1000; +long numRecordsInRange = 0; +for (int i = 0; i < numRuns; i++) { + int partitionPicked = Math.max(0, partitions.size() - 30); + long start = System.currentTimeMillis(); + Map records = reader.getRecordsInRange(partitions.get(partitionPicked), partitions.get(partitions.size() - 1)); + duration += (System.currentTimeMillis() - start); + numRecordsInRange += records.size(); +} +double avgDuration = duration / (double) numRuns; +double avgRecordsFetched = numRecordsInRange / (double) numRuns; +System.out.println("Average time taken to lookup a range: " + avgDuration + "ms. Avg number records: " + avgRecordsFetched); + } + + // key: partition (partition), value: map (filePath -> column range) + @Test + public void testPerformanceRangeKeyPartition() throws IOException { +final String instantTime = "100"; +final HoodieWriteConfig cfg = getConfig(); +final Path hfilePath = new Path(basePath + "/hfile_partition/f1_1-0-1_000.hfile"); +HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient); +HoodieHFileWriter hfileWriter = (HoodieHFileWriter) HoodieFileWriterFactory.getFileWriter(instantTime, +hfilePath, table, cfg, HoodieMetadataRecord.SCHEMA$, supplier); +Random random = new Random(); + +int numPartitions = 1; +int avgFilesPerPartition = 10; + +long startTime = System.currentTimeMillis(); +List partitions = new ArrayList<>(); +for (int i = 0; i < numPartitions; i++) { + String partitionPath = "partition-" + String.format("%010d", i); + partitions.add(partitionPath); + Map fileToRangeInfo = new HashMap<>(); + for (int j = 0; j < avgFilesPerPartition; j++) { +String filePath = "file-" + String.format("%010d", j) + "_1-0-1_000.parquet"; +int max = random.nextInt(); +if (max < 0) { + max = -max; +} +int min = random.nextInt(max); +fileToRangeInfo.put(filePath, HoodieRangeIndexInfo.newBuilder().setMax("" + max).setMin("" + min).setIsDeleted(false).build()); + } + + HoodieKey key = new HoodieKey(partitionPath, partitionPath); + GenericRecord rec = new GenericData.Record(HoodieMetadataRecord.SCHEMA$); +
[GitHub] [hudi] satishkotha opened a new pull request #2635: [WIP] [DO NOT MERGE] sample code for measuring hfile performance with column ranges
satishkotha opened a new pull request #2635: URL: https://github.com/apache/hudi/pull/2635 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1636) Support Builder Pattern To Build Table Properties For HoodieTableConfig
[ https://issues.apache.org/jira/browse/HUDI-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-1636: --- Fix Version/s: 0.8.0 > Support Builder Pattern To Build Table Properties For HoodieTableConfig > - > > Key: HUDI-1636 > URL: https://issues.apache.org/jira/browse/HUDI-1636 > Project: Apache Hudi > Issue Type: Improvement > Components: Common Core >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Currently we use HoodieTableMetaClient#initTableType to init the table > properties. As the number of Table Properties increases,It is hard to > maintain the code for the initTableType method, especially when we add a new > table property. Support Builder Pattern to build the table config properties > may solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new bc883db [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596) bc883db is described below commit bc883db5de5832fa429bbb04a35d3606fdacdb2a Author: pengzhiwei AuthorDate: Fri Mar 5 14:10:27 2021 +0800 [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596) --- .../org/apache/hudi/cli/commands/TableCommand.java | 12 +- .../metadata/HoodieBackedTableMetadataWriter.java | 11 +- .../TestHoodieClientOnCopyOnWriteStorage.java | 23 +- .../java/org/apache/hudi/client/TestMultiFS.java | 14 +- .../hudi/client/TestTableSchemaEvolution.java | 15 +- .../hudi/testutils/FunctionalTestHarness.java | 12 +- .../hudi/common/table/HoodieTableMetaClient.java | 276 + .../table/timeline/TestHoodieActiveTimeline.java | 8 +- .../hudi/common/testutils/HoodieTestUtils.java | 9 +- .../java/HoodieJavaWriteClientExample.java | 7 +- .../examples/spark/HoodieWriteClientExample.java | 7 +- .../java/org/apache/hudi/util/StreamerUtil.java| 16 +- .../hudi/integ/testsuite/HoodieTestSuiteJob.java | 8 +- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 26 +- .../hudi/functional/TestStreamingSource.scala | 14 +- .../apache/hudi/hive/testutils/HiveTestUtil.java | 22 +- .../apache/hudi/utilities/HDFSParquetImporter.java | 8 +- .../utilities/deltastreamer/BootstrapExecutor.java | 14 +- .../hudi/utilities/deltastreamer/DeltaSync.java| 20 +- .../functional/TestHoodieSnapshotExporter.java | 9 +- .../utilities/testutils/UtilitiesTestBase.java | 7 +- 21 files changed, 341 insertions(+), 197 deletions(-) diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java index 168de26..d25e0c8 100644 --- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java +++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java @@ -22,7 +22,6 @@ import org.apache.hudi.cli.HoodieCLI; import org.apache.hudi.cli.HoodiePrintHelper; import org.apache.hudi.cli.TableHeader; import org.apache.hudi.common.fs.ConsistencyGuardConfig; -import org.apache.hudi.common.model.HoodieTableType; import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.exception.TableNotFoundException; @@ -106,10 +105,13 @@ public class TableCommand implements CommandMarker { throw new IllegalStateException("Table already existing in path : " + path); } -final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr); -HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, archiveFolder, -payloadClass, layoutVersion); - +HoodieTableMetaClient.withPropertyBuilder() + .setTableType(tableTypeStr) + .setTableName(name) + .setArchiveLogFolder(archiveFolder) + .setPayloadClassName(payloadClass) + .setTimelineLayoutVersion(layoutVersion) + .initTable(HoodieCLI.conf, path); // Now connect to ensure loading works return connect(path, layoutVersion, false, 0, 0, 0); } diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index 6629410..dbd678f 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -288,9 +288,14 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta String createInstantTime = latestInstant.map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP); LOG.info("Creating a new metadata table in " + metadataWriteConfig.getBasePath() + " at instant " + createInstantTime); -HoodieTableMetaClient.initTableType(hadoopConf.get(), metadataWriteConfig.getBasePath(), -HoodieTableType.MERGE_ON_READ, tableName, "archived", HoodieMetadataPayload.class.getName(), -HoodieFileFormat.HFILE.toString()); +HoodieTableMetaClient.withPropertyBuilder() + .setTableType(HoodieTableType.MERGE_ON_READ) + .setTableName(tableName) + .setArchiveLogFolder("archived") + .setPayloadClassName(HoodieMetadataPayload.class.getName()) + .setBaseFileFormat(HoodieFileFormat.HFILE.toString()) + .initTable(hadoopConf.get(), metadataWriteConfig.getBasePath()); + initTableMetadata(); // List all partitions in the basePath of the containing dataset diff --git
[jira] [Closed] (HUDI-1636) Support Builder Pattern To Build Table Properties For HoodieTableConfig
[ https://issues.apache.org/jira/browse/HUDI-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-1636. -- Resolution: Done > Support Builder Pattern To Build Table Properties For HoodieTableConfig > - > > Key: HUDI-1636 > URL: https://issues.apache.org/jira/browse/HUDI-1636 > Project: Apache Hudi > Issue Type: Improvement > Components: Common Core >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Currently we use HoodieTableMetaClient#initTableType to init the table > properties. As the number of Table Properties increases,It is hard to > maintain the code for the initTableType method, especially when we add a new > table property. Support Builder Pattern to build the table config properties > may solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua merged pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig
yanghua merged pull request #2596: URL: https://github.com/apache/hudi/pull/2596 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig
yanghua commented on a change in pull request #2596: URL: https://github.com/apache/hudi/pull/2596#discussion_r588054717 ## File path: hudi-examples/src/main/java/org/apache/hudi/examples/java/HoodieJavaWriteClientExample.java ## @@ -72,8 +72,11 @@ public static void main(String[] args) throws Exception { Path path = new Path(tablePath); FileSystem fs = FSUtils.getFs(tablePath, hadoopConf); if (!fs.exists(path)) { - HoodieTableMetaClient.initTableType(hadoopConf, tablePath, HoodieTableType.valueOf(tableType), - tableName, HoodieAvroPayload.class.getName()); + HoodieTableConfig.propertyBuilder() +.setTableType(tableType) +.setTableName(tableName) +.setPayloadClassName(HoodieAvroPayload.class.getName()) +.initTable(hadoopConf, tablePath); Review comment: Agree to move forward and do some subsequence refactor. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (#2621)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f53bca4 [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (#2621) f53bca4 is described below commit f53bca404f1482e0e99ad683dd29bfaff8bfb8ab Author: Raymond Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Thu Mar 4 21:01:51 2021 -0800 [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (#2621) - Add a config to allow parsing custom date format in `DatePartitionPathSelector`. Currently it assumes date partition string in the format of `-MM-dd`. - Fix a bug where `UnsupportedOperationException` was thrown when sort `eligibleFiles` in-place. Changed to sort it and store in a new list. --- .../sources/helpers/DatePartitionPathSelector.java | 30 +++-- .../helpers/TestDatePartitionPathSelector.java | 38 -- 2 files changed, 41 insertions(+), 27 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java index 2cedb6c..c22657f 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java @@ -35,13 +35,16 @@ import org.apache.log4j.Logger; import org.apache.spark.api.java.JavaSparkContext; import java.time.LocalDate; +import java.time.format.DateTimeFormatter; import java.util.ArrayList; import java.util.Comparator; import java.util.List; import java.util.stream.Collectors; import static org.apache.hudi.utilities.sources.helpers.DFSPathSelector.Config.ROOT_INPUT_PATH_PROP; +import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DATE_FORMAT; import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DATE_PARTITION_DEPTH; +import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_DATE_FORMAT; import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_DATE_PARTITION_DEPTH; import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_LOOKBACK_DAYS; import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_PARTITIONS_LIST_PARALLELISM; @@ -59,12 +62,16 @@ import static org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelecto * The date based partition is expected to be of the format '=-mm-dd' or * '-mm-dd'. The date partition can be at any level. For ex. the partition path can be of the * form `` or - * `/` + * `/`. + * + * The date based partition format can be configured via this property + * hoodie.deltastreamer.source.dfs.datepartitioned.date.format */ public class DatePartitionPathSelector extends DFSPathSelector { private static volatile Logger LOG = LogManager.getLogger(DatePartitionPathSelector.class); + private final String dateFormat; private final int datePartitionDepth; private final int numPrevDaysToList; private final LocalDate fromDate; @@ -73,6 +80,9 @@ public class DatePartitionPathSelector extends DFSPathSelector { /** Configs supported. */ public static class Config { +public static final String DATE_FORMAT = "hoodie.deltastreamer.source.dfs.datepartitioned.date.format"; +public static final String DEFAULT_DATE_FORMAT = "-MM-dd"; + public static final String DATE_PARTITION_DEPTH = "hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth"; public static final int DEFAULT_DATE_PARTITION_DEPTH = 0; // Implies no (date) partition @@ -84,7 +94,6 @@ public class DatePartitionPathSelector extends DFSPathSelector { public static final String CURRENT_DATE = "hoodie.deltastreamer.source.dfs.datepartitioned.selector.currentdate"; - public static final String PARTITIONS_LIST_PARALLELISM = "hoodie.deltastreamer.source.dfs.datepartitioned.selector.parallelism"; public static final int DEFAULT_PARTITIONS_LIST_PARALLELISM = 20; @@ -96,6 +105,7 @@ public class DatePartitionPathSelector extends DFSPathSelector { * datePartitionDepth = 0 is same as basepath and there is no partition. In which case * this path selector would be a no-op and lists all paths under the table basepath. */ +dateFormat = props.getString(DATE_FORMAT, DEFAULT_DATE_FORMAT); datePartitionDepth = props.getInteger(DATE_PARTITION_DEPTH, DEFAULT_DATE_PARTITION_DEPTH); // If not specified the
[GitHub] [hudi] xushiyan merged pull request #2621: [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector
xushiyan merged pull request #2621: URL: https://github.com/apache/hudi/pull/2621 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2625: [1568] Fixing spark3 bundles
codecov-io edited a comment on pull request #2625: URL: https://github.com/apache/hudi/pull/2625#issuecomment-789780238 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=h1) Report > Merging [#2625](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=desc) (369d44f) into [master](https://codecov.io/gh/apache/hudi/commit/527175ab0be59d15f035a09a0466bfeca9abfb23?el=desc) (527175a) will **decrease** coverage by `4.81%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2625/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2625 +/- ## - Coverage 50.94% 46.13% -4.82% - Complexity 3169 3171 +2 Files 433 462 +29 Lines 1981421782+1968 Branches 2034 2315 +281 - Hits 1009510049 -46 - Misses 890010883+1983 - Partials819 850 +31 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.45% <ø> (+0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `50.31% <ø> (+7.09%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.48% <ø> (+0.32%)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.67% <ø> (-0.07%)` | `0.00 <ø> (ø)` | | | hudisync | `49.62% <ø> (+1.00%)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (-2.13%)` | `0.00 <ø> (ø)` | | | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | |
[GitHub] [hudi] codecov-io edited a comment on pull request #2625: [1568] Fixing spark3 bundles
codecov-io edited a comment on pull request #2625: URL: https://github.com/apache/hudi/pull/2625#issuecomment-789780238 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=h1) Report > Merging [#2625](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=desc) (369d44f) into [master](https://codecov.io/gh/apache/hudi/commit/527175ab0be59d15f035a09a0466bfeca9abfb23?el=desc) (527175a) will **decrease** coverage by `6.62%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2625/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2625 +/- ## - Coverage 50.94% 44.32% -6.63% + Complexity 3169 2976 -193 Files 433 426 -7 Lines 1981420229 +415 Branches 2034 2114 +80 - Hits 10095 8967-1128 - Misses 890010525+1625 + Partials819 737 -82 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.45% <ø> (+0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `50.31% <ø> (+7.09%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.48% <ø> (+0.32%)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `49.62% <ø> (+1.00%)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (-2.13%)` | `0.00 <ø> (ø)` | | | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | |
[GitHub] [hudi] codecov-io edited a comment on pull request #2625: [1568] Fixing spark3 bundles
codecov-io edited a comment on pull request #2625: URL: https://github.com/apache/hudi/pull/2625#issuecomment-789780238 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=h1) Report > Merging [#2625](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=desc) (369d44f) into [master](https://codecov.io/gh/apache/hudi/commit/527175ab0be59d15f035a09a0466bfeca9abfb23?el=desc) (527175a) will **decrease** coverage by `7.40%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2625/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2625 +/- ## - Coverage 50.94% 43.54% -7.41% + Complexity 3169 2786 -383 Files 433 405 -28 Lines 1981418718-1096 Branches 2034 1969 -65 - Hits 10095 8151-1944 - Misses 8900 9905+1005 + Partials819 662 -157 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.45% <ø> (+0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `50.31% <ø> (+7.09%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.48% <ø> (+0.32%)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | |
[GitHub] [hudi] nsivabalan edited a comment on pull request #2625: [1568] Fixing spark3 bundles
nsivabalan edited a comment on pull request #2625: URL: https://github.com/apache/hudi/pull/2625#issuecomment-789746642 CC @vinothchandar @garyli1019 @bvaradar This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan removed a comment on pull request #2625: [1568] Adding hudi-spark3-bundle
nsivabalan removed a comment on pull request #2625: URL: https://github.com/apache/hudi/pull/2625#issuecomment-789746287 @umehrot2 @zhedoubushishi : Can you folks help me out here. I see we have added support for spark3 [here](https://github.com/apache/hudi/pull/2208). Did we test the bundle? I do see both spark2 and spark3 are included in pom of hudi-spark-bundle, but how do I generate the jars for spark3. ``` org.apache.hudi:hudi-spark_${scala.binary.version} org.apache.hudi:hudi-spark2_${scala.binary.version} org.apache.hudi:hudi-spark3_2.12 ``` because, if I run below command, all I see is regular spark-bundle which fails for MOR. So, I guess its not spark3. ``` mvn clean package -DskipTests -Dscala-2.12 -Dspark3 ``` With this patch, I have added a new bundle altogether and it works. But wanted to see if I am missing something here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1666) Refactor BaseCleanActionExecutor to return List
Nishith Agarwal created HUDI-1666: - Summary: Refactor BaseCleanActionExecutor to return List Key: HUDI-1666 URL: https://issues.apache.org/jira/browse/HUDI-1666 Project: Apache Hudi Issue Type: Improvement Components: Cleaner Reporter: Nishith Agarwal Assignee: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
xiarixiaoyao edited a comment on pull request #2634: URL: https://github.com/apache/hudi/pull/2634#issuecomment-791119070 cc @garyli1019 , could you take a look This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
xiarixiaoyao edited a comment on pull request #2634: URL: https://github.com/apache/hudi/pull/2634#issuecomment-791119070 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
xiarixiaoyao commented on pull request #2634: URL: https://github.com/apache/hudi/pull/2634#issuecomment-791119070 cc @garyli1019 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1665) Remove autoCommit option from BaseCommitActionExecutor
Nishith Agarwal created HUDI-1665: - Summary: Remove autoCommit option from BaseCommitActionExecutor Key: HUDI-1665 URL: https://issues.apache.org/jira/browse/HUDI-1665 Project: Apache Hudi Issue Type: Bug Components: Writer Core Reporter: Nishith Agarwal Assignee: Nishith Agarwal Currently, the option to autoCommit inside the action-executor framework creates 2 different code paths for committing data in Hudi a) From AbstractHoodieWriteClient b) BaseCommitActionExecutor. We should refactor and remove the commit code path from ActionExecutor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1486) Remove pending rollback and move to cleaner
[ https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-1486: - Assignee: Nishith Agarwal > Remove pending rollback and move to cleaner > --- > > Key: HUDI-1486 > URL: https://issues.apache.org/jira/browse/HUDI-1486 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1662: - Labels: pull-request-available (was: ) > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > use beeline/spark-sql execute statement select * from > huditest.bulkinsert_mor_10g_rt where primary_key = 1000; > then the follow error will occur: > _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ > > > Root cause analysis: > hudi use avro format to store log file, avro store DateType as INT(Type is > INT but logcialType is date)。 > when hudi read log file and convert avro INT type record to > writable,logicalType is not respected which lead the dateType will cast to > IntWritable。 > seem: > [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] > > Modification plan: when cast avro INT type to writable, logicalType must > be considerd > case INT: > if (schema.getLogicalType() != null && > schema.getLogicalType().getName().equals("date")) { > return new DateWritable((Integer) value); > } else { > return new IntWritable((Integer) value); > } > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1486) Remove pending rollback and move to cleaner
[ https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1486: -- Status: Patch Available (was: In Progress) > Remove pending rollback and move to cleaner > --- > > Key: HUDI-1486 > URL: https://issues.apache.org/jira/browse/HUDI-1486 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1486) Remove pending rollback and move to cleaner
[ https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1486: -- Status: Open (was: New) > Remove pending rollback and move to cleaner > --- > > Key: HUDI-1486 > URL: https://issues.apache.org/jira/browse/HUDI-1486 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1486) Remove pending rollback and move to cleaner
[ https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1486: -- Status: In Progress (was: Open) > Remove pending rollback and move to cleaner > --- > > Key: HUDI-1486 > URL: https://issues.apache.org/jira/browse/HUDI-1486 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao opened a new pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
xiarixiaoyao opened a new pull request #2634: URL: https://github.com/apache/hudi/pull/2634 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request fixed the bug that: Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType test step: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2 ## Brief change log when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。so cast avro INT type to writable, logicalType must be considered ## Verify this pull request Existing UT tests ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1655) Support custom date format and fix unsupported exception in DatePartitionPathSelector
[ https://issues.apache.org/jira/browse/HUDI-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1655: - Description: Add a config to allow parsing custom date format in {{DatePartitionPathSelector}}. Currently it assumes date partition string in the format of {{-MM-dd}}. Also eligibleFiles.sort() throws this exception {quote}java.lang.UnsupportedOperationException at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) at org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) {quote} {{org.apache.hudi.client.common.HoodieSparkEngineContext#flatMap}} returns a list that can't be sorted in-place. was: Add a config to allow parsing custom date format in {{DatePartitionPathSelector}}. Currently it assumes date partition string in the format of {{-MM-dd}}. Also eligibleFiles.sort() throws this exception {quote}java.lang.UnsupportedOperationException at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) at org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) {quote} > Support custom date format and fix unsupported exception in > DatePartitionPathSelector > - > > Key: HUDI-1655 > URL: https://issues.apache.org/jira/browse/HUDI-1655 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > > Add a config to allow parsing custom date format in > {{DatePartitionPathSelector}}. Currently it assumes date partition string in > the format of {{-MM-dd}}. > > Also eligibleFiles.sort() throws this exception > {quote}java.lang.UnsupportedOperationException at > java.util.AbstractList.set(AbstractList.java:132) at > java.util.AbstractList$ListItr.set(AbstractList.java:426) at > java.util.List.sort(List.java:482) at > org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) > at > org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) > at > org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) > at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at > org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) > {quote} > > {{org.apache.hudi.client.common.HoodieSparkEngineContext#flatMap}} returns a > list that can't be sorted in-place. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1655) Support custom date format and fix unsupported exception in DatePartitionPathSelector
[ https://issues.apache.org/jira/browse/HUDI-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1655: - Summary: Support custom date format and fix unsupported exception in DatePartitionPathSelector (was: Allow custom date format in DatePartitionPathSelector) > Support custom date format and fix unsupported exception in > DatePartitionPathSelector > - > > Key: HUDI-1655 > URL: https://issues.apache.org/jira/browse/HUDI-1655 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > > Add a config to allow parsing custom date format in > {{DatePartitionPathSelector}}. Currently it assumes date partition string in > the format of {{-MM-dd}}. > > Also eligibleFiles.sort() throws this exception > {quote}java.lang.UnsupportedOperationException at > java.util.AbstractList.set(AbstractList.java:132) at > java.util.AbstractList$ListItr.set(AbstractList.java:426) at > java.util.List.sort(List.java:482) at > org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) > at > org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) > at > org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) > at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at > org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xushiyan commented on pull request #2621: [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector
xushiyan commented on pull request #2621: URL: https://github.com/apache/hudi/pull/2621#issuecomment-791115825 > @xushiyan It would be also good to change the title of the PR to add more information about fixing the bug? @yanghua ok done. Pls check. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1655) Allow custom date format in DatePartitionPathSelector
[ https://issues.apache.org/jira/browse/HUDI-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1655: - Description: Add a config to allow parsing custom date format in {{DatePartitionPathSelector}}. Currently it assumes date partition string in the format of {{-MM-dd}}. Also eligibleFiles.sort() throws this exception {quote}java.lang.UnsupportedOperationException at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) at org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) {quote} was:Add a config to allow parsing custom date format in {{DatePartitionPathSelector}}. Currently it assumes date partition string in the format of {{-MM-dd}}. > Allow custom date format in DatePartitionPathSelector > - > > Key: HUDI-1655 > URL: https://issues.apache.org/jira/browse/HUDI-1655 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Minor > Labels: pull-request-available > Fix For: 0.8.0 > > > Add a config to allow parsing custom date format in > {{DatePartitionPathSelector}}. Currently it assumes date partition string in > the format of {{-MM-dd}}. > > Also eligibleFiles.sort() throws this exception > {quote}java.lang.UnsupportedOperationException at > java.util.AbstractList.set(AbstractList.java:132) at > java.util.AbstractList$ListItr.set(AbstractList.java:426) at > java.util.List.sort(List.java:482) at > org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) > at > org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) > at > org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) > at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at > org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua commented on pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector
yanghua commented on pull request #2621: URL: https://github.com/apache/hudi/pull/2621#issuecomment-791113338 @xushiyan It would be also good to change the title of the PR to add more information about fixing the bug? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector
yanghua commented on a change in pull request #2621: URL: https://github.com/apache/hudi/pull/2621#discussion_r587994448 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java ## @@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, Configuration hadoopConf FileSystem fs = new Path(path).getFileSystem(serializedConf.get()); return listEligibleFiles(fs, new Path(path), lastCheckpointTime).stream(); }, partitionsListParallelism); -// sort them by modification time. - eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); +// sort them by modification time ascending. +List sortedEligibleFiles = eligibleFiles.stream() + .sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList()); Review comment: OK, would you please add it into the ticket's description for the archiving purpose? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] root18039532923 commented on issue #2614: Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token
root18039532923 commented on issue #2614: URL: https://github.com/apache/hudi/issues/2614#issuecomment-791105284 I need the jar of the patch which you opened to test,but I am using inner-net. @satishkotha This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1662: --- Description: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ Root cause analysis: hudi use avro format to store log file, avro store DateType as INT(Type is INT but logcialType is date)。 when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。 seem: [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] Modification plan: when cast avro INT type to writable, logicalType must be considerd case INT: if (schema.getLogicalType() != null && schema.getLogicalType().getName().equals("date")) { return new DateWritable((Integer) value); } else { return new IntWritable((Integer) value); } was: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ Root cause analysis: hudi use avro format to store log file, avro store DateType as INT(Type is INT but logcialType is date)。 when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。 seem: [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] Modification plan: when cast avro INT type to writable, logicalType must be considerd if (schema.getLogicalType() != null && schema.getLogicalType().getName() == "date") { return new DateWritable((Integer) value); } else { return new IntWritable((Integer) value); } > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > use beeline/spark-sql execute statement select * from > huditest.bulkinsert_mor_10g_rt where primary_key = 1000; > then the follow error will occur: > _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ > > > Root cause analysis: > hudi use avro format to store log file, avro store DateType as INT(Type is > INT but logcialType is date)。 > when hudi read log file and convert avro INT type record to > writable,logicalType is not respected which lead the dateType will cast
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1662: --- Description: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ Root cause analysis: hudi use avro format to store log file, avro store DateType as INT(Type is INT but logcialType is date)。 when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。 seem: [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] Modification plan: when cast avro INT type to writable, logicalType must be considerd if (schema.getLogicalType() != null && schema.getLogicalType().getName() == "date") { return new DateWritable((Integer) value); } else { return new IntWritable((Integer) value); } was: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !image-2021-03-05-10-06-11-949.png! > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > use beeline/spark-sql execute statement select * from > huditest.bulkinsert_mor_10g_rt where primary_key = 1000; > then the follow error will occur: > _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ > > > Root cause analysis: > hudi use avro format to store log file, avro store DateType as INT(Type is > INT but logcialType is date)。 > when hudi read log file and convert avro INT type record to > writable,logicalType is not respected which lead the dateType will cast to > IntWritable。 > seem: > [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] > > Modification plan: when cast avro INT type to writable, logicalType must be > considerd > if (schema.getLogicalType() != null && schema.getLogicalType().getName() == > "date") { > return new DateWritable((Integer) value); > } else { > return new IntWritable((Integer) value); > } > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1664) Streaming read for Flink MOR table
[ https://issues.apache.org/jira/browse/HUDI-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-1664. Resolution: Duplicate > Streaming read for Flink MOR table > -- > > Key: HUDI-1664 > URL: https://issues.apache.org/jira/browse/HUDI-1664 > Project: Apache Hudi > Issue Type: Sub-task > Components: Flink Integration >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 0.8.0 > > > Supports reading as streaming for Flink MOR table. The writer writes avro > logs and reader monitor the latest commits in the timeline the assign the > split reading task for the incremental logs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1663) Streaming read for Flink MOR table
Danny Chen created HUDI-1663: Summary: Streaming read for Flink MOR table Key: HUDI-1663 URL: https://issues.apache.org/jira/browse/HUDI-1663 Project: Apache Hudi Issue Type: Sub-task Components: Flink Integration Reporter: Danny Chen Assignee: Danny Chen Fix For: 0.8.0 Supports reading as streaming for Flink MOR table. The writer writes avro logs and reader monitor the latest commits in the timeline the assign the split reading task for the incremental logs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1664) Streaming read for Flink MOR table
Danny Chen created HUDI-1664: Summary: Streaming read for Flink MOR table Key: HUDI-1664 URL: https://issues.apache.org/jira/browse/HUDI-1664 Project: Apache Hudi Issue Type: Sub-task Components: Flink Integration Reporter: Danny Chen Assignee: Danny Chen Fix For: 0.8.0 Supports reading as streaming for Flink MOR table. The writer writes avro logs and reader monitor the latest commits in the timeline the assign the split reading task for the incremental logs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1662: --- Description: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !image-2021-03-05-10-06-11-949.png! was: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !file:///C:/Users/m00443775/AppData/Roaming/eSpace_Desktop/UserData/m00443775/imagefiles/847287BC-86FF-4EFF-9BDC-C2A60FFD3F47.png|id=847287BC-86FF-4EFF-9BDC-C2A60FFD3F47,vspace=3! > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > !image-2021-03-05-10-06-11-949.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
tao meng created HUDI-1662: -- Summary: Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType Key: HUDI-1662 URL: https://issues.apache.org/jira/browse/HUDI-1662 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.7.0 Environment: hive 3.1.1 spark 2.4.5 hadoop 3.1.1 suse os Reporter: tao meng step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !file:///C:/Users/m00443775/AppData/Roaming/eSpace_Desktop/UserData/m00443775/imagefiles/847287BC-86FF-4EFF-9BDC-C2A60FFD3F47.png|id=847287BC-86FF-4EFF-9BDC-C2A60FFD3F47,vspace=3! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated (89003bc -> 7cc75e0)
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 89003bc [HUDI-1647] Supports snapshot read for Flink (#2613) add 7cc75e0 [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (#2611) No new revisions were added by this update. Summary of changes: .../table/timeline/HoodieDefaultTimeline.java | 7 .../hudi/common/table/timeline/HoodieTimeline.java | 5 +++ .../common/table/view/FileSystemViewManager.java | 17 ++--- .../apache/hudi/hadoop/utils/HoodieHiveUtils.java | 22 .../hudi/hadoop/utils/HoodieInputFormatUtils.java | 3 +- .../hudi/hadoop/TestHoodieParquetInputFormat.java | 41 ++ .../hudi/hadoop/testutils/InputFormatTestUtil.java | 20 +++ 7 files changed, 109 insertions(+), 6 deletions(-)
[GitHub] [hudi] n3nash merged pull request #2611: [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat
n3nash merged pull request #2611: URL: https://github.com/apache/hudi/pull/2611 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector
xushiyan commented on a change in pull request #2621: URL: https://github.com/apache/hudi/pull/2621#discussion_r587949226 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java ## @@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, Configuration hadoopConf FileSystem fs = new Path(path).getFileSystem(serializedConf.get()); return listEligibleFiles(fs, new Path(path), lastCheckpointTime).stream(); }, partitionsListParallelism); -// sort them by modification time. - eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); +// sort them by modification time ascending. +List sortedEligibleFiles = eligibleFiles.stream() + .sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList()); Review comment: @yanghua yes it resulted in this error ``` java.lang.UnsupportedOperationException at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) at org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) ``` `org.apache.hudi.client.common.HoodieSparkEngineContext#flatMap` returns a list that can't be sorted in-place. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector
xushiyan commented on a change in pull request #2621: URL: https://github.com/apache/hudi/pull/2621#discussion_r587949226 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java ## @@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, Configuration hadoopConf FileSystem fs = new Path(path).getFileSystem(serializedConf.get()); return listEligibleFiles(fs, new Path(path), lastCheckpointTime).stream(); }, partitionsListParallelism); -// sort them by modification time. - eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); +// sort them by modification time ascending. +List sortedEligibleFiles = eligibleFiles.stream() + .sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList()); Review comment: yes it resulted in this error ``` java.lang.UnsupportedOperationException at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141) at org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1647) Supports snapshot read for Flink
[ https://issues.apache.org/jira/browse/HUDI-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-1647. -- Resolution: Implemented Implemented via master branch: 89003bc7801b035b5be31c76bfbf691bfcf9081a > Supports snapshot read for Flink > > > Key: HUDI-1647 > URL: https://issues.apache.org/jira/browse/HUDI-1647 > Project: Apache Hudi > Issue Type: Sub-task > Components: Flink Integration >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Support snapshot read for Flink for both MOR and COW table. > - COW: the parquet files for the latest file group slices > - MOR: the parquet base file + log files for the latest file group slices > Also implements the SQL connectors for both slink and source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1647) Supports snapshot read for Flink
[ https://issues.apache.org/jira/browse/HUDI-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-1647: --- Fix Version/s: 0.8.0 > Supports snapshot read for Flink > > > Key: HUDI-1647 > URL: https://issues.apache.org/jira/browse/HUDI-1647 > Project: Apache Hudi > Issue Type: Sub-task > Components: Flink Integration >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Support snapshot read for Flink for both MOR and COW table. > - COW: the parquet files for the latest file group slices > - MOR: the parquet base file + log files for the latest file group slices > Also implements the SQL connectors for both slink and source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua merged pull request #2613: [HUDI-1647] Supports snapshot read for Flink
yanghua merged pull request #2613: URL: https://github.com/apache/hudi/pull/2613 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (899ae70 -> 89003bc)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 899ae70 [HUDI-1587] Add latency and freshness support (#2541) add 89003bc [HUDI-1647] Supports snapshot read for Flink (#2613) No new revisions were added by this update. Summary of changes: hudi-flink/pom.xml | 41 +- .../apache/hudi/factory/HoodieTableFactory.java| 84 .../org/apache/hudi/operator/FlinkOptions.java | 67 ++- .../apache/hudi/operator/StreamWriteFunction.java | 8 + .../apache/hudi/operator/StreamWriteOperator.java | 8 +- .../operator/StreamWriteOperatorCoordinator.java | 34 +- .../hudi/operator/StreamWriteOperatorFactory.java | 10 +- .../operator/partitioner/BucketAssignFunction.java | 1 - .../java/org/apache/hudi/sink/HoodieTableSink.java | 126 + .../org/apache/hudi/source/HoodieTableSource.java | 411 + .../apache/hudi/source/format/FilePathUtils.java | 320 + .../org/apache/hudi/source/format/FormatUtils.java | 98 .../source/format/cow/AbstractColumnReader.java| 324 + .../source/format/cow/CopyOnWriteInputFormat.java | 134 ++ .../format/cow/Int64TimestampColumnReader.java | 99 .../format/cow/ParquetColumnarRowSplitReader.java | 370 +++ .../source/format/cow/ParquetDecimalVector.java| 69 +++ .../source/format/cow/ParquetSplitReaderUtil.java | 398 .../hudi/source/format/cow/RunLengthDecoder.java | 304 .../source/format/mor/MergeOnReadInputFormat.java | 513 + .../source/format/mor/MergeOnReadInputSplit.java | 88 .../source/format/mor/MergeOnReadTableState.java | 79 .../org/apache/hudi/util/AvroSchemaConverter.java | 180 +++- .../apache/hudi/util/AvroToRowDataConverters.java | 316 + .../java/org/apache/hudi/util/StreamerUtil.java| 30 ++ .../org.apache.flink.table.factories.TableFactory | 5 +- .../apache/hudi/operator/StreamWriteITCase.java| 43 +- .../StreamWriteOperatorCoordinatorTest.java| 2 +- .../operator/utils/StreamWriteFunctionWrapper.java | 2 +- .../hudi/operator/utils/TestConfigurations.java| 52 ++- .../org/apache/hudi/operator/utils/TestData.java | 66 ++- .../apache/hudi/source/HoodieDataSourceITCase.java | 162 +++ .../apache/hudi/source/HoodieTableSourceTest.java | 122 + .../apache/hudi/source/format/InputFormatTest.java | 197 .../utils/factory/ContinuousFileSourceFactory.java | 62 +++ .../hudi/utils/source/ContinuousFileSource.java| 173 +++ .../org.apache.flink.table.factories.TableFactory | 7 +- style/checkstyle-suppressions.xml | 1 + 38 files changed, 4920 insertions(+), 86 deletions(-) create mode 100644 hudi-flink/src/main/java/org/apache/hudi/factory/HoodieTableFactory.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/sink/HoodieTableSink.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/FilePathUtils.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/FormatUtils.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/AbstractColumnReader.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/CopyOnWriteInputFormat.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/Int64TimestampColumnReader.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetColumnarRowSplitReader.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetDecimalVector.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetSplitReaderUtil.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/cow/RunLengthDecoder.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/mor/MergeOnReadInputFormat.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/mor/MergeOnReadInputSplit.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/source/format/mor/MergeOnReadTableState.java create mode 100644 hudi-flink/src/main/java/org/apache/hudi/util/AvroToRowDataConverters.java copy hudi-integ-test/src/test/resources/hoodie-docker.properties => hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory (94%) create mode 100644 hudi-flink/src/test/java/org/apache/hudi/source/HoodieDataSourceITCase.java create mode 100644 hudi-flink/src/test/java/org/apache/hudi/source/HoodieTableSourceTest.java create mode 100644
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new a2efa69 Travis CI build asf-site a2efa69 is described below commit a2efa690d6b6847a7375caa50e9f96729455e539 Author: CI AuthorDate: Fri Mar 5 00:39:50 2021 + Travis CI build asf-site --- content/cn/docs/docker_demo.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/cn/docs/docker_demo.html b/content/cn/docs/docker_demo.html index 05f3395..60668ea 100644 --- a/content/cn/docs/docker_demo.html +++ b/content/cn/docs/docker_demo.html @@ -574,13 +574,13 @@ inorder to run Hive queries against those datasets. docker exec -it adhoc-2 /bin/bash # THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS and sync the /var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh --jdbc-/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool. [...] . 2018-09-24 22:22:45,568 INFO [main] hive.HiveSyncToo [...] . # Now run hive-sync for the second data-set in HDFS using Merge-On/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh --jdbc-/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool. [...] ... 2018-09-24 22:23:09,171 INFO [main] hive.HiveSyncToo [...] ...
[hudi] branch asf-site updated: [DOCS] Update 0_4_docker_demo.cn.md (#2629)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 2619964 [DOCS] Update 0_4_docker_demo.cn.md (#2629) 2619964 is described below commit 2619964b07abc78f345247d634bc729eb3880f4d Author: Wei AuthorDate: Fri Mar 5 08:37:37 2021 +0800 [DOCS] Update 0_4_docker_demo.cn.md (#2629) Fix docs hive_sync path --- docs/_docs/0_4_docker_demo.cn.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/_docs/0_4_docker_demo.cn.md b/docs/_docs/0_4_docker_demo.cn.md index d0005b2..63bf33e 100644 --- a/docs/_docs/0_4_docker_demo.cn.md +++ b/docs/_docs/0_4_docker_demo.cn.md @@ -192,13 +192,13 @@ inorder to run Hive queries against those datasets. docker exec -it adhoc-2 /bin/bash # THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS and sync the HDFS state to Hive -/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh --jdbc-url jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_cow --database default --table stock_ticks_cow +/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh --jdbc-url jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_cow --database default --table stock_ticks_cow . 2018-09-24 22:22:45,568 INFO [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow . # Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR storage) -/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh --jdbc-url jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_mor --database default --table stock_ticks_mor +/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh --jdbc-url jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt --base-path /user/hive/warehouse/stock_ticks_mor --database default --table stock_ticks_mor ... 2018-09-24 22:23:09,171 INFO [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor ...
[GitHub] [hudi] vinothchandar merged pull request #2629: [DOCS] Fix docs hive_sync path
vinothchandar merged pull request #2629: URL: https://github.com/apache/hudi/pull/2629 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes
bvaradar commented on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-790975178 @nsivabalan is looking into this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 edited a comment on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?
umehrot2 edited a comment on issue #2592: URL: https://github.com/apache/hudi/issues/2592#issuecomment-790955939 @codejoyan I think the problem stems because you are using `org.apache.spark:spark-avro_2.11:2.4.4` in you packages with spark-submit. This is incompatible with `spark-sql 2.3.0` that is why you get that `NoSuchMethod` exception. What happens if you use `databricks-avro` package instead of `spark-avro`: https://mvnrepository.com/artifact/com.databricks/spark-avro ? Can you give that a shot ? Also @bvaradar is right. When you build your Hudi and use databricks-avro package, please build you Hudi with `-Pspark-shade-unbundle-avro` to avoid bundling/shading of spark-avro module in hudi-spark-bundle. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?
umehrot2 commented on issue #2592: URL: https://github.com/apache/hudi/issues/2592#issuecomment-790955939 @codejoyan I think the problem stems because you are using `org.apache.spark:spark-avro_2.11:2.4.4` in you packages with spark-submit. This is incompatible with `spark-sql 2.3.0` that is why you get that `NoSuchMethod` exception. Bundling/shading this package will not solve the problem either because its just incompatible. What happens if you use `databricks-avro` package instead of `spark-avro`: https://mvnrepository.com/artifact/com.databricks/spark-avro ? Can you give that a shot ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.
satishkotha commented on issue #2589: URL: https://github.com/apache/hudi/issues/2589#issuecomment-790942436 yeah, column deletions are not supported today. You can consider making all columns optional and continue writing null for fields that you want to delete. @bvaradar Could you take a look at MOR table log files schema issue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2632: [HUDI-1661] Exclude clustering commits from TimelineUtils API
codecov-io edited a comment on pull request #2632: URL: https://github.com/apache/hudi/pull/2632#issuecomment-790858581 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=h1) Report > Merging [#2632](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=desc) (52706ea) into [master](https://codecov.io/gh/apache/hudi/commit/899ae70fdb70c1511c099a64230fd91b2fe8d4ee?el=desc) (899ae70) will **increase** coverage by `0.03%`. > The diff coverage is `86.66%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2632/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2632 +/- ## + Coverage 51.58% 51.62% +0.03% - Complexity 3285 3293 +8 Files 446 446 Lines 2040920423 +14 Branches 2116 2116 + Hits 1052810543 +15 Misses 9003 9003 + Partials878 877 -1 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.46% <86.66%> (+0.04%)` | `0.00 <9.00> (ø)` | | | hudiflink | `51.29% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.67% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `49.62% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.59% <ø> (+0.15%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ache/hudi/common/table/timeline/TimelineUtils.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lVXRpbHMuamF2YQ==) | `62.71% <86.66%> (+7.15%)` | `21.00 <9.00> (+7.00)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `71.07% <0.00%> (+0.35%)` | `53.00% <0.00%> (+1.00%)` | | | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `65.69% <0.00%> (+1.16%)` | `33.00% <0.00%> (ø%)` | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on issue #2614: Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token
satishkotha commented on issue #2614: URL: https://github.com/apache/hudi/issues/2614#issuecomment-790940752 @root18039532923 Did you get a chance to try this patch? We can merge patch after your testing looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2627: [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator
satishkotha commented on a change in pull request #2627: URL: https://github.com/apache/hudi/pull/2627#discussion_r587802720 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java ## @@ -20,20 +20,32 @@ import org.apache.avro.generic.GenericRecord; import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; import org.apache.spark.sql.Row; +import java.util.Arrays; +import java.util.Collections; import java.util.List; +import java.util.stream.Collectors; /** * Simple Key generator for unpartitioned Hive Tables. */ -public class NonpartitionedKeyGenerator extends SimpleKeyGenerator { +public class NonpartitionedKeyGenerator extends BuiltinKeyGenerator { private final NonpartitionedAvroKeyGenerator nonpartitionedAvroKeyGenerator; - public NonpartitionedKeyGenerator(TypedProperties config) { -super(config); -nonpartitionedAvroKeyGenerator = new NonpartitionedAvroKeyGenerator(config); + public NonpartitionedKeyGenerator(TypedProperties props) { +super(props); +this.recordKeyFields = Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY) Review comment: Do you need these initialization here? Can we override getPartitionPathFields and getRecordKeyFields and call nonPartitionedAvroKeyGenerator instead to avoid code repetition? ## File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/keygen/TestNonpartitionedKeyGenerator.java ## @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.keygen; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.testutils.HoodieTestDataGenerator; +import org.apache.hudi.exception.HoodieKeyException; + +import org.apache.avro.generic.GenericRecord; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.testutils.KeyGeneratorTestUtilities; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Assertions; +import org.junit.jupiter.api.Test; + +import static junit.framework.TestCase.assertEquals; + +public class TestNonpartitionedKeyGenerator extends KeyGeneratorTestUtilities { + + private TypedProperties getCommonProps(boolean getComplexRecordKey) { +TypedProperties properties = new TypedProperties(); +if (getComplexRecordKey) { + properties.put(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key, pii_col"); +} else { + properties.put(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key"); +} +properties.put(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true"); +return properties; + } + + private TypedProperties getPropertiesWithoutPartitionPathProp() { +return getCommonProps(false); + } + + private TypedProperties getPropertiesWithPartitionPathProp() { +TypedProperties properties = getCommonProps(true); +properties.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY, "timestamp,ts_ms"); +return properties; + } + + private TypedProperties getPropertiesWithoutRecordKeyProp() { +TypedProperties properties = new TypedProperties(); +properties.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY, "timestamp"); +return properties; + } + + private TypedProperties getWrongRecordKeyFieldProps() { +TypedProperties properties = new TypedProperties(); +properties.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY, "timestamp"); +properties.put(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY, "_wrong_key"); +return properties; + } + + @Test + public void testNullRecordKeyFields() { +Assertions.assertThrows(IllegalArgumentException.class, () -> new NonpartitionedKeyGenerator(getPropertiesWithoutRecordKeyProp())); + } + + @Test + public void testNullPartitionPathFields() { +TypedProperties properties = getPropertiesWithoutPartitionPathProp(); +NonpartitionedKeyGenerator keyGenerator = new NonpartitionedKeyGenerator(properties); +GenericRecord record = getRecord(); +Row row = KeyGeneratorTestUtilities.getRow(record);
[GitHub] [hudi] vinothchandar edited a comment on issue #2633: Empty File Slice causing application to fail in small files optimization code
vinothchandar edited a comment on issue #2633: URL: https://github.com/apache/hudi/issues/2633#issuecomment-790926531 @n3nash so seems to corroborate with udit's finding then. cc @bvaradar as well, who can comment on the suggested fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on issue #2633: Empty File Slice causing application to fail in small files optimization code
vinothchandar commented on issue #2633: URL: https://github.com/apache/hudi/issues/2633#issuecomment-790926531 cc @bvaradar as well, who can confirm. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2633: Empty File Slice causing application to fail in small files optimization code
n3nash commented on issue #2633: URL: https://github.com/apache/hudi/issues/2633#issuecomment-790917128 @umehrot2 From what you are describing, it looks like a bug. When we configure HbaseIndex, we automatically assume ``` public boolean canIndexLogFiles() { return true; } ``` and inserts go into log files. But this part of the code -> https://github.com/apache/hudi/blob/release-0.6.0/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L190 is assuming that either a) inserts first go into parquet files in which case there will be a base file or b) Delta file was already written earlier and needs to be re-sized. Ideally, this code should create a new log file with the new base instant time and send inserts there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] modi95 commented on pull request #2628: [HUDI-1635] Improvements to Hudi Test Suite
modi95 commented on pull request #2628: URL: https://github.com/apache/hudi/pull/2628#issuecomment-790916909 @nbalajee these changes look good to me. Can try doing the following: - Run schema evolution test suite in the docker setup (step-by-step guide available in test-suite readme) - Add instructions to the test-suite readme on how to use schema evolution This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on issue #2633: Empty File Slice causing application to fail in small files optimization code
umehrot2 commented on issue #2633: URL: https://github.com/apache/hudi/issues/2633#issuecomment-790891245 @bvaradar can you confirm that the finding is correct since it seems you worked on that file system view implementation ? Also cc @n3nash @vinothchandar This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 opened a new issue #2633: Empty File Slice causing application to fail in small files optimization code
umehrot2 opened a new issue #2633: URL: https://github.com/apache/hudi/issues/2633 **Describe the problem you faced** IHAC who is using Hudi's `Spark structured streaming sink` with `asynchronous compaction` and `Hbase Index` on EMR. The Hudi version being used is 0.6.0. After a while their job fails with the following error: ``` java.util.NoSuchElementException: No value present at java.util.Optional.get(Optional.java:135) at org.apache.hudi.table.action.deltacommit.UpsertDeltaCommitPartitioner.getSmallFiles(UpsertDeltaCommitPartitioner.java:103) at org.apache.hudi.table.action.commit.UpsertPartitioner.lambda$getSmallFilesForPartitions$d2bd4b49$1(UpsertPartitioner.java:216) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1043) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1043) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1334) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$15.apply(RDD.scala:990) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$15.apply(RDD.scala:990) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Here is a timeline of events as well leading up to the failure of the final delta commit `20210228152758`: ![image](https://user-images.githubusercontent.com/8647012/110020866-ec103080-7cde-11eb-859a-f539d4217f6a.png) As an observation you can see that there are two asynchronous compactions running at the same before the delta commit failure. After discussions with Vinoth, I also verified that there are no common file groups between the two concurrent compactions, so nothing suspicious was found there. I followed the code path, and as we can see the error stems from here where its assuming that some log files will be present in the file slice: https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L103 Upon following this piece of code, I found this suspicious code: https://github.com/apache/hudi/blob/release-0.6.0/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L190 It is generating an empty file slice if there is a pending compaction, without any base file or log file. This is bound to cause failure in the small files logic. Can you please confirm if my findings make sense to you ? Also from the comment: ``` // If there is no delta-commit after compaction request, this step would ensure a new file-slice appears // so that any new ingestion uses the correct base-instant ``` I only see us checking for pending compaction, but not for presence of delta commit after compaction request. So is that code actually doing whats its intended to ? As for our customer, I am suggestion for now to turn off the small files optimizations feature by setting `HoodieStorageConfig.PARQUET_FILE_MAX_BYTES` to `0`. Another action item from this, is that small files optimizations code needs a check to ignore the file slice if its completely empty instead of causing a failure, since empty file slice can be expected. **Environment
[GitHub] [hudi] vinothchandar commented on a change in pull request #2374: [HUDI-845] Added locking capability to allow multiple writers
vinothchandar commented on a change in pull request #2374: URL: https://github.com/apache/hudi/pull/2374#discussion_r587049910 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -30,16 +31,19 @@ import org.apache.hudi.callback.util.HoodieCommitCallbackFactory; import org.apache.hudi.client.embedded.EmbeddedTimelineService; import org.apache.hudi.client.heartbeat.HeartbeatUtils; +import org.apache.hudi.client.lock.TransactionManager; Review comment: `lock` as a package name feels off to me. Can we have `org.apache.hudi.client.transaction.TransactionManager`? Then `.lock` can be a sub package under i.e `.transaction.lock.` ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -30,16 +31,19 @@ import org.apache.hudi.callback.util.HoodieCommitCallbackFactory; import org.apache.hudi.client.embedded.EmbeddedTimelineService; import org.apache.hudi.client.heartbeat.HeartbeatUtils; +import org.apache.hudi.client.lock.TransactionManager; import org.apache.hudi.common.engine.HoodieEngineContext; import org.apache.hudi.common.model.HoodieCommitMetadata; import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy; import org.apache.hudi.common.model.HoodieKey; import org.apache.hudi.common.model.HoodieRecordPayload; import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.model.TableService; Review comment: So, `model` package should just contain pojos i.e data structure objects. Lets move `TableService` elsewhere ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -210,6 +231,11 @@ void emitCommitMetrics(String instantTime, HoodieCommitMetadata metadata, String } } + protected void preCommit(String instantTime, HoodieCommitMetadata metadata) { +// no-op +// TODO : Conflict resolution is not support for Flink,Java engines Review comment: typo: not supported ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -797,7 +853,9 @@ public Boolean rollbackFailedWrites() { * Performs a compaction operation on a table, serially before or after an insert/upsert action. */ protected Option inlineCompact(Option> extraMetadata) { -Option compactionInstantTimeOpt = scheduleCompaction(extraMetadata); +String schedulingCompactionInstant = HoodieActiveTimeline.createNewInstantTime(); Review comment: rename: `compactionInstantTime` ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/lock/ConflictResolutionStrategy.java ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.client.lock; + +import org.apache.hudi.ApiMaturityLevel; +import org.apache.hudi.PublicAPIMethod; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.metadata.HoodieBackedTableMetadataWriter; +import org.apache.hudi.table.HoodieTable; + +import java.util.stream.Stream; + +/** + * Strategy interface for conflict resolution with multiple writers. + * Users can provide pluggable implementations for different kinds of strategies to resolve conflicts when multiple + * writers are mutating the hoodie table. + */ +public interface ConflictResolutionStrategy { + + /** + * Stream of instants to check conflicts against. + * @return + */ + Stream getInstantsStream(HoodieActiveTimeline activeTimeline, HoodieInstant currentInstant, Option lastSuccessfulInstant); + + /** + * Implementations of this method will determine whether a conflict exists between 2 commits. + * @param thisOperation + * @param otherOperation + * @return + */ + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + boolean hasConflict(HoodieCommitOperation thisOperation,
[GitHub] [hudi] codecov-io commented on pull request #2632: [HUDI-1661] Exclude clustering commits from TimelineUtils API
codecov-io commented on pull request #2632: URL: https://github.com/apache/hudi/pull/2632#issuecomment-790858581 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=h1) Report > Merging [#2632](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=desc) (fd53720) into [master](https://codecov.io/gh/apache/hudi/commit/899ae70fdb70c1511c099a64230fd91b2fe8d4ee?el=desc) (899ae70) will **increase** coverage by `17.85%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2632/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2632 +/- ## = + Coverage 51.58% 69.44% +17.85% + Complexity 3285 363 -2922 = Files 446 53 -393 Lines 20409 1944-18465 Branches 2116 235 -1881 = - Hits 10528 1350 -9178 + Misses 9003 460 -8543 + Partials878 134 -744 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.44% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...org/apache/hudi/cli/commands/RollbacksCommand.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1JvbGxiYWNrc0NvbW1hbmQuamF2YQ==) | | | | | [.../org/apache/hudi/common/model/HoodieFileGroup.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUZpbGVHcm91cC5qYXZh) | | | | | [...metadata/HoodieMetadataMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllTWV0YWRhdGFNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=) | | | | | [...di-cli/src/main/java/org/apache/hudi/cli/Main.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL01haW4uamF2YQ==) | | | | | [...mmon/table/log/AbstractHoodieLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9BYnN0cmFjdEhvb2RpZUxvZ1JlY29yZFNjYW5uZXIuamF2YQ==) | | | | | [.../versioning/clean/CleanPlanV2MigrationHandler.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5QbGFuVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=) | | | | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | | | | | [.../main/scala/org/apache/hudi/cli/SparkHelpers.scala](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL2NsaS9TcGFya0hlbHBlcnMuc2NhbGE=) | | | | | [...src/main/java/org/apache/hudi/QuickstartUtils.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvUXVpY2tzdGFydFV0aWxzLmphdmE=) | | | | | [...apache/hudi/cli/commands/HoodieLogFileCommand.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0hvb2RpZUxvZ0ZpbGVDb21tYW5kLmphdmE=) | | | | | ... and [382 more](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree-more) | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1661) Change utility methods that help get extra metadata to ignore internal rewrite commits
[ https://issues.apache.org/jira/browse/HUDI-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1661: - Labels: pull-request-available (was: ) > Change utility methods that help get extra metadata to ignore internal > rewrite commits > -- > > Key: HUDI-1661 > URL: https://issues.apache.org/jira/browse/HUDI-1661 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Major > Labels: pull-request-available > > Exclude clustering commits from getExtraMetadataFromLatest API used for > getting user stored information such as checkpoint. > Provide a new API incase there are use cases that store same metadata for > internal (clustering) and external (commit/deltacommit). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] satishkotha opened a new pull request #2632: [HUDI-1661] Exclude clustering commits from TimelineUtils API
satishkotha opened a new pull request #2632: URL: https://github.com/apache/hudi/pull/2632 ## What is the purpose of the pull request Exclude internal rewrite commit such as clustering commits from getExtraMetadataFromLatest API ## Brief change log getExtraMetadataFromLatest API is used for getting user stored information such as checkpoint. Clustering commits dont have this information. So exclude clustering from this API. Provide new API for use cases that require reading extra metadata from all commits including clustering ## Verify this pull request This change added tests ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1661) Change utility methods that help get extra metadata to ignore internal rewrite commits
satish created HUDI-1661: Summary: Change utility methods that help get extra metadata to ignore internal rewrite commits Key: HUDI-1661 URL: https://issues.apache.org/jira/browse/HUDI-1661 Project: Apache Hudi Issue Type: Sub-task Reporter: satish Assignee: satish Exclude clustering commits from getExtraMetadataFromLatest API used for getting user stored information such as checkpoint. Provide a new API incase there are use cases that store same metadata for internal (clustering) and external (commit/deltacommit). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] n3nash commented on pull request #2628: [HUDI-1635] Improvements to Hudi Test Suite
n3nash commented on pull request #2628: URL: https://github.com/apache/hudi/pull/2628#issuecomment-790821135 @modi95 can you review this ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2611: [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat
codecov-io edited a comment on pull request #2611: URL: https://github.com/apache/hudi/pull/2611#issuecomment-787641189 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=h1) Report > Merging [#2611](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=desc) (427523d) into [master](https://codecov.io/gh/apache/hudi/commit/be257b58c689510a21529019a766b7a2bfc7ebe6?el=desc) (be257b5) will **decrease** coverage by `41.65%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2611/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2611 +/- ## - Coverage 51.27% 9.61% -41.66% + Complexity 3241 48 -3193 Files 438 53 -385 Lines 201261944-18182 Branches 2079 235 -1844 - Hits 10320 187-10133 + Misses 89541744 -7210 + Partials852 13 -839 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[GitHub] [hudi] n3nash opened a new pull request #2631: [HUDI 1660] Excluding compaction and clustering instants from inflight rollback
n3nash opened a new pull request #2631: URL: https://github.com/apache/hudi/pull/2631 ## What is the purpose of the pull request *This PR fixes a bug to ensure that pending compaction & clustering operations are always excluded when performing inflight rollbacks* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1660) Exclude pending compaction & clustering from rollback
Nishith Agarwal created HUDI-1660: - Summary: Exclude pending compaction & clustering from rollback Key: HUDI-1660 URL: https://issues.apache.org/jira/browse/HUDI-1660 Project: Apache Hudi Issue Type: Bug Components: Writer Core Reporter: Nishith Agarwal Assignee: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1656) Loading history data to new hudi table taking longer time
[ https://issues.apache.org/jira/browse/HUDI-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fredrick jose antony cruz updated HUDI-1656: Description: spark-submit --jars /u/users/svcordrdats/order_hudi_poc/hudi-support-jars/org.apache.avro_avro-1.8.2.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/spark-avro_2.11-2.4.4.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/hudi-spark-bundle_2.11-0.7.0.jar --master yarn --deploy-mode cluster --num-executors 50 --executor-cores 4 --executor-memory 32g --driver-memory=24g --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.driver.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar --conf spark.executor.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar:hudi-spark-bundle_2.11-0.7.0.jar --conf spark.memory.fraction=0.2 --driver-java-options "-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --files /usr/hdp/current/spark2-client/conf/hive-site.xml --class com.walmart.gis.order.workflows.WorkflowController lib/orders-poc-1.0.41-SNAPSHOT-shaded.jar workflow="stgStsWorkflow" runmode="global" we are running on GCS cluster with 3 TB, 29 node cluster 870 v. core. pom 1.8 2.11.12 2.3.0 1.8.2 2.4.4 0.7.0 1.4.0 UTF-8 1.8 1.8 stsDailyDf.write.format("org.apache.hudi") .option("hoodie.cleaner.commits.retained", 2) .option("hoodie.copyonwrite.record.size.estimate", 70) .option("hoodie.parquet.small.file.limit", 1) .option("hoodie.parquet.max.file.size", 12800) .option("hoodie.index.bloom.num_entries", 180) .option("hoodie.bloom.index.filter.type", "DYNAMIC_V0") .option("hoodie.bloom.index.filter.dynamic.max.entries", 250) .option("hoodie.datasource.write.operation", "upsert") .option("hoodie.datasource.write.storage.type", "COPY_ON_WRITE") .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "sales_order_sts_line_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "REL_STS_DT") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "src_upd_ts") .option(HoodieWriteConfig.TABLE_NAME, tableName.toString) .option("hoodie.bloom.index.bucketized.checking", "false") .mode(SaveMode.Append) .save(tablePath.toString) > Loading history data to new hudi table taking longer time > - > > Key: HUDI-1656 > URL: https://issues.apache.org/jira/browse/HUDI-1656 > Project: Apache Hudi > Issue Type: Improvement > Components: newbie >Reporter: Fredrick jose antony cruz >Priority: Major > Fix For: 0.7.0 > > > spark-submit --jars > /u/users/svcordrdats/order_hudi_poc/hudi-support-jars/org.apache.avro_avro-1.8.2.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/spark-avro_2.11-2.4.4.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/hudi-spark-bundle_2.11-0.7.0.jar > --master yarn --deploy-mode cluster --num-executors 50 --executor-cores 4 > --executor-memory 32g --driver-memory=24g --queue=default --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.driver.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar > --conf > spark.executor.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar:hudi-spark-bundle_2.11-0.7.0.jar > --conf spark.memory.fraction=0.2 --driver-java-options "-XX:NewSize=1g > -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps > -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime > -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --files > /usr/hdp/current/spark2-client/conf/hive-site.xml --class > com.walmart.gis.order.workflows.WorkflowController > lib/orders-poc-1.0.41-SNAPSHOT-shaded.jar workflow="stgStsWorkflow" > runmode="global" > we are running on GCS cluster with 3 TB, 29 node cluster 870 v. core. > pom > 1.8 > 2.11.12 > 2.3.0 > 1.8.2 > 2.4.4 > 0.7.0 > 1.4.0 > UTF-8 > 1.8 > 1.8 > stsDailyDf.write.format("org.apache.hudi") > .option("hoodie.cleaner.commits.retained",
[jira] [Updated] (HUDI-1659) Support DDL And Insert For Hudi In Spark Sql
[ https://issues.apache.org/jira/browse/HUDI-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengzhiwei updated HUDI-1659: - Summary: Support DDL And Insert For Hudi In Spark Sql (was: DDL Support For Hudi In Spark Sql) > Support DDL And Insert For Hudi In Spark Sql > - > > Key: HUDI-1659 > URL: https://issues.apache.org/jira/browse/HUDI-1659 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] rswagatika commented on issue #2564: Hoodie clean is not deleting old files
rswagatika commented on issue #2564: URL: https://github.com/apache/hudi/issues/2564#issuecomment-790707124 @bvaradar Hi I was able to get the files but the driver log am trying will provide you once i have it. Let me know if this is what you meaning for recursive s3 dataset? 2021-02-10 18:44:030 Bytes my_test/.hoodie/.aux/.bootstrap/.fileids_$folder$ 2021-02-10 18:44:020 Bytes my_test/.hoodie/.aux/.bootstrap/.partitions_$folder$ 2021-02-10 18:44:020 Bytes my_test/.hoodie/.aux/.bootstrap_$folder$ 2021-02-12 12:19:223.9 KiB my_test/.hoodie/.aux/20210212171919.compaction.requested 2021-02-12 16:12:233.9 KiB my_test/.hoodie/.aux/20210212211220.compaction.requested 2021-02-10 18:44:020 Bytes my_test/.hoodie/.aux_$folder$ 2021-02-12 12:19:240 Bytes my_test/.hoodie/.temp/20210212171919/range_partition=default/af837aeb-3881-43f0-8aa4-434502059b68-0_0-33-41_20210212171919.parquet.marker.MERGE 2021-02-12 12:19:240 Bytes my_test/.hoodie/.temp/20210212171919/range_partition=default_$folder$ 2021-02-12 12:19:240 Bytes my_test/.hoodie/.temp/20210212171919_$folder$ 2021-02-12 16:12:250 Bytes my_test/.hoodie/.temp/20210212211220/range_partition=default/af837aeb-3881-43f0-8aa4-434502059b68-0_0-41-43_20210212211220.parquet.marker.MERGE 2021-02-12 16:12:250 Bytes my_test/.hoodie/.temp/20210212211220/range_partition=default_$folder$ 2021-02-12 16:12:250 Bytes my_test/.hoodie/.temp/20210212211220_$folder$ 2021-02-10 18:44:020 Bytes my_test/.hoodie/.temp_$folder$ 2021-02-10 19:23:35 969 Bytes my_test/.hoodie/20210211002333.rollback 2021-02-10 19:23:350 Bytes my_test/.hoodie/20210211002333.rollback.inflight 2021-02-10 19:24:22 969 Bytes my_test/.hoodie/20210211002421.rollback 2021-02-10 19:24:220 Bytes my_test/.hoodie/20210211002421.rollback.inflight 2021-02-10 19:34:30 969 Bytes my_test/.hoodie/20210211003429.rollback 2021-02-10 19:34:300 Bytes my_test/.hoodie/20210211003429.rollback.inflight 2021-02-10 19:35:18 969 Bytes my_test/.hoodie/20210211003516.rollback 2021-02-10 19:35:180 Bytes my_test/.hoodie/20210211003516.rollback.inflight 2021-02-10 19:35:51 969 Bytes my_test/.hoodie/20210211003549.rollback 2021-02-10 19:35:500 Bytes my_test/.hoodie/20210211003549.rollback.inflight 2021-02-10 19:36:39 969 Bytes my_test/.hoodie/20210211003637.rollback 2021-02-10 19:36:380 Bytes my_test/.hoodie/20210211003637.rollback.inflight 2021-02-10 19:37:13 969 Bytes my_test/.hoodie/20210211003711.rollback 2021-02-10 19:37:120 Bytes my_test/.hoodie/20210211003711.rollback.inflight 2021-02-10 19:38:02 969 Bytes my_test/.hoodie/20210211003800.rollback 2021-02-10 19:38:010 Bytes my_test/.hoodie/20210211003800.rollback.inflight 2021-02-11 01:05:37 969 Bytes my_test/.hoodie/20210211060535.rollback 2021-02-11 01:05:360 Bytes my_test/.hoodie/20210211060535.rollback.inflight 2021-02-11 03:36:47 969 Bytes my_test/.hoodie/20210211083645.rollback 2021-02-11 03:36:470 Bytes my_test/.hoodie/20210211083645.rollback.inflight 2021-02-11 06:25:37 969 Bytes my_test/.hoodie/2021022535.rollback 2021-02-11 06:25:370 Bytes my_test/.hoodie/2021022535.rollback.inflight 2021-02-11 21:21:47 969 Bytes my_test/.hoodie/20210212022145.rollback 2021-02-11 21:21:460 Bytes my_test/.hoodie/20210212022145.rollback.inflight 2021-02-12 10:24:29 1021 Bytes my_test/.hoodie/20210212152423.rollback 2021-02-12 10:24:290 Bytes my_test/.hoodie/20210212152423.rollback.inflight 2021-02-12 12:19:201.5 KiB my_test/.hoodie/20210212171816.deltacommit 2021-02-12 12:19:171.6 KiB my_test/.hoodie/20210212171816.deltacommit.inflight 2021-02-12 12:18:220 Bytes my_test/.hoodie/20210212171816.deltacommit.requested 2021-02-12 12:21:271.5 KiB my_test/.hoodie/20210212171919.commit 2021-02-12 12:19:220 Bytes my_test/.hoodie/20210212171919.compaction.inflight 2021-02-12 12:19:223.9 KiB my_test/.hoodie/20210212171919.compaction.requested 2021-02-12 14:55:261.5 KiB my_test/.hoodie/20210212195421.deltacommit 2021-02-12 14:55:231.6 KiB my_test/.hoodie/20210212195421.deltacommit.inflight 2021-02-12 14:54:280 Bytes my_test/.hoodie/20210212195421.deltacommit.requested 2021-02-12 16:12:211.5 KiB my_test/.hoodie/2021021227.deltacommit 2021-02-12 16:12:181.6 KiB my_test/.hoodie/2021021227.deltacommit.inflight 2021-02-12 16:11:240 Bytes my_test/.hoodie/2021021227.deltacommit.requested 2021-02-12 16:14:281.5 KiB my_test/.hoodie/20210212211220.commit 2021-02-12 16:12:230 Bytes my_test/.hoodie/20210212211220.compaction.inflight 2021-02-12 16:12:233.9 KiB my_test/.hoodie/20210212211220.compaction.requested 2021-02-11 09:45:40 26.4 KiB my_test/.hoodie/archived/.commits_.archive.10_1-0-1 2021-02-11 16:24:45 26.5
[jira] [Created] (HUDI-1659) DDL Support For Hudi In Spark Sql
pengzhiwei created HUDI-1659: Summary: DDL Support For Hudi In Spark Sql Key: HUDI-1659 URL: https://issues.apache.org/jira/browse/HUDI-1659 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: pengzhiwei Assignee: pengzhiwei -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1658) Spark Sql Support For Hudi
pengzhiwei created HUDI-1658: Summary: Spark Sql Support For Hudi Key: HUDI-1658 URL: https://issues.apache.org/jira/browse/HUDI-1658 Project: Apache Hudi Issue Type: New Feature Components: Spark Integration Reporter: pengzhiwei Assignee: pengzhiwei Fix For: 0.8.0 This is the main task for supporting spark sql for hudi, including the DDL、DML and Hoodie CLI command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] guanziyue opened a new issue #2630: [SUPPORT]Confuse about the strategy to evaluate average record size
guanziyue opened a new issue #2630: URL: https://github.com/apache/hudi/issues/2630 **Describe the problem you faced** In the UpsertPartitioner class, the method called averageBytesPerRecord is used to evaluate the average record size according to previous commits. There is a fileSizeThreshold used to filter commit which can be evaluated. Such value is calculated as "hoodieWriteConfig.getRecordSizeEstimationThreshold() * hoodieWriteConfig.getParquetSmallFileLimit();" May I have known why it goes like this? It may need a huge enough commit make this strategy activated. **Expected behavior** A small value may also work? How about hoodie.parquet.max.file.size? **Environment Description** * Hudi version :0.6.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2613: [HUDI-1647] Supports snapshot read for Flink
danny0405 commented on a change in pull request #2613: URL: https://github.com/apache/hudi/pull/2613#discussion_r587409966 ## File path: hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory ## @@ -0,0 +1,17 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +org.apache.hudi.factory.HoodieTableFactory Review comment: Required for the java SPI service. ## File path: hudi-flink/src/test/resources/META-INF/services/org.apache.flink.table.factories.TableFactory ## @@ -0,0 +1,18 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +org.apache.hudi.factory.HoodieTableFactory +org.apache.hudi.utils.factory.ContinuousFileSourceFactory Review comment: Required for the java SPI service. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector
yanghua commented on a change in pull request #2621: URL: https://github.com/apache/hudi/pull/2621#discussion_r587389540 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java ## @@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, Configuration hadoopConf FileSystem fs = new Path(path).getFileSystem(serializedConf.get()); return listEligibleFiles(fs, new Path(path), lastCheckpointTime).stream(); }, partitionsListParallelism); -// sort them by modification time. - eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); +// sort them by modification time ascending. +List sortedEligibleFiles = eligibleFiles.stream() + .sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList()); Review comment: you mean the old logic exists this bug? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on a change in pull request #2613: [HUDI-1647] Supports snapshot read for Flink
garyli1019 commented on a change in pull request #2613: URL: https://github.com/apache/hudi/pull/2613#discussion_r587262267 ## File path: hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory ## @@ -0,0 +1,17 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +org.apache.hudi.factory.HoodieTableFactory Review comment: is this file necessary? ## File path: hudi-flink/src/test/java/org/apache/hudi/source/HoodieDataSourceITCase.java ## @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.hudi.operator.FlinkOptions; +import org.apache.hudi.operator.utils.TestConfigurations; + +import org.apache.flink.table.api.EnvironmentSettings; +import org.apache.flink.table.api.TableEnvironment; +import org.apache.flink.table.api.TableResult; +import org.apache.flink.table.api.config.ExecutionConfigOptions; +import org.apache.flink.table.api.internal.TableEnvironmentImpl; +import org.apache.flink.test.util.AbstractTestBase; +import org.apache.flink.types.Row; +import org.apache.flink.util.CollectionUtil; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import java.util.concurrent.ExecutionException; +import java.util.stream.Collectors; + +import static org.hamcrest.CoreMatchers.is; +import static org.hamcrest.MatcherAssert.assertThat; + +/** + * IT cases for Hoodie table source and sink. + * + * Note: should add more SQL cases when batch write is supported. + */ +public class HoodieDataSourceITCase extends AbstractTestBase { + private TableEnvironment streamTableEnv; + private TableEnvironment batchTableEnv; + + @BeforeEach + void beforeEach() { +EnvironmentSettings settings = EnvironmentSettings.newInstance().build(); +streamTableEnv = TableEnvironmentImpl.create(settings); +streamTableEnv.getConfig().getConfiguration() + .setInteger(ExecutionConfigOptions.TABLE_EXEC_RESOURCE_DEFAULT_PARALLELISM, 1); +streamTableEnv.getConfig().getConfiguration() +.setString("execution.checkpointing.interval", "2s"); + +settings = EnvironmentSettings.newInstance().inBatchMode().build(); +batchTableEnv = TableEnvironmentImpl.create(settings); +batchTableEnv.getConfig().getConfiguration() + .setInteger(ExecutionConfigOptions.TABLE_EXEC_RESOURCE_DEFAULT_PARALLELISM, 1); + } + + @TempDir + File tempFile; + + @Test + void testStreamWriteBatchRead() { Review comment: is that possible to add more test cases to cover COW/MOR(without comapction)/MOR(with compaction) and query? ## File path: hudi-flink/src/test/resources/META-INF/services/org.apache.flink.table.factories.TableFactory ## @@ -0,0 +1,18 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless
[GitHub] [hudi] chaplinthink opened a new pull request #2629: [DOCS] Fix docs hive_sync path
chaplinthink opened a new pull request #2629: URL: https://github.com/apache/hudi/pull/2629 Fix docs hive_sync path ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org