date:20210304

[GitHub] [hudi] codecov-io commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



codecov-io commented on pull request #2634:
URL: https://github.com/apache/hudi/pull/2634#issuecomment-791203944


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=h1) Report
   > Merging 
[#2634](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=desc) (ccf9a8f) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/899ae70fdb70c1511c099a64230fd91b2fe8d4ee?el=desc)
 (899ae70) will **increase** coverage by `10.05%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2634/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2634   +/-   ##
   =
   + Coverage 51.58%   61.64%   +10.05% 
   + Complexity 3285  325 -2960 
   =
 Files   446   53  -393 
 Lines 20409 1963-18446 
 Branches   2116  235 -1881 
   =
   - Hits  10528 1210 -9318 
   + Misses 9003  630 -8373 
   + Partials878  123  -755 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `61.64% <ø> (-7.81%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2634?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `5.17% <0.00%> (-83.63%)` | `0.00% <0.00%> (-28.00%)` | |
   | 
[...hudi/utilities/schema/JdbcbasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9KZGJjYmFzZWRTY2hlbWFQcm92aWRlci5qYXZh)
 | `0.00% <0.00%> (-72.23%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...he/hudi/utilities/transform/AWSDmsTransformer.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9BV1NEbXNUcmFuc2Zvcm1lci5qYXZh)
 | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `41.86% <0.00%> (-22.68%)` | `27.00% <0.00%> (-6.00%)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.34% <0.00%> (-0.37%)` | `53.00% <0.00%> (+1.00%)` | :arrow_down: |
   | 
[...nal/HoodieBulkInsertDataInternalWriterFactory.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmsyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2ludGVybmFsL0hvb2RpZUJ1bGtJbnNlcnREYXRhSW50ZXJuYWxXcml0ZXJGYWN0b3J5LmphdmE=)
 | | | |
   | 
[.../apache/hudi/common/fs/TimedFSDataInputStream.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1RpbWVkRlNEYXRhSW5wdXRTdHJlYW0uamF2YQ==)
 | | | |
   | 
[...g/apache/hudi/exception/HoodieRemoteException.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVJlbW90ZUV4Y2VwdGlvbi5qYXZh)
 | | | |
   | 
[...a/org/apache/hudi/common/util/ValidationUtils.java](https://codecov.io/gh/apache/hudi/pull/2634/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvVmFsaWRhdGlvblV0aWxzLmphdmE=)
 | | | |
   | ... and [390

[GitHub] [hudi] garyli1019 commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



garyli1019 commented on pull request #2634:
URL: https://github.com/apache/hudi/pull/2634#issuecomment-791202125


   @xiarixiaoyao your force push already triggered the CI. Do you mean JIRA 
contributor access? If so, would you send an email to the dev mailing list with 
your JIRA ID? That's how we usually grant access to new contributors. Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



xiarixiaoyao commented on pull request #2634:
URL: https://github.com/apache/hudi/pull/2634#issuecomment-791188663


   Thank you @garyli1019 
   but I don't have permission to trigger CI, could you help.
   BYT, could you give me the contributor permission? thank you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2635: [WIP] [DO NOT MERGE] sample code for measuring hfile performance with column ranges

2021-03-04 Thread GitBox



satishkotha commented on a change in pull request #2635:
URL: https://github.com/apache/hudi/pull/2635#discussion_r588059204



##
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/storage/TestHoodieFileWriterFactory.java
##
@@ -64,4 +80,246 @@ public void testGetFileWriter() throws IOException {
 }, "should fail since log storage writer is not supported yet.");
 assertTrue(thrown.getMessage().contains("format not supported yet."));
   }
+  
+  // key: full file path (/tmp/.../partition/file-000.parquet, value: 
column range 
+  @Test
+  public void testPerformanceRangeKeyPartitionFile() throws IOException {
+final String instantTime = "100";
+final HoodieWriteConfig cfg = getConfig();
+final Path hfilePath = new Path(basePath + 
"/hfile_partition/f1_1-0-1_000.hfile");
+HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient);
+HoodieHFileWriter hfileWriter = 
(HoodieHFileWriter) 
HoodieFileWriterFactory.getFileWriter(instantTime,
+hfilePath, table, cfg, HoodieMetadataRecord.SCHEMA$, supplier);
+Random random = new Random();
+
+int numPartitions = 1000;
+int avgFilesPerPartition = 10;
+
+long startTime = System.currentTimeMillis();
+List partitions = new ArrayList<>();
+for (int i = 0; i < numPartitions; i++) {
+  String partitionPath = "partition-" + String.format("%010d", i);
+  partitions.add(partitionPath);
+  for (int j = 0; j < avgFilesPerPartition; j++) {
+String filePath = "file-" + String.format("%010d", j) + 
"_1-0-1_000.parquet";
+int max = random.nextInt();
+if (max < 0) {
+  max = -max;
+}
+int min = random.nextInt(max);
+
+HoodieKey key = new HoodieKey(partitionPath + filePath, partitionPath);
+GenericRecord rec = new 
GenericData.Record(HoodieMetadataRecord.SCHEMA$);
+rec.put("key", key.getRecordKey());
+rec.put("type", 2);
+rec.put("rangeIndexMetadata", 
HoodieRangeIndexInfo.newBuilder().setMax("" + max).setMin("" + 
min).setIsDeleted(false).build());
+hfileWriter.writeAvro(key.getRecordKey(), rec);
+  }
+}
+
+hfileWriter.close();
+long durationInMs = System.currentTimeMillis() - startTime;
+System.out.println("Time taken to generate & write: " + durationInMs + " 
ms. file path: " + hfilePath
++ " File size: " + FSUtils.getFileSize(metaClient.getFs(), hfilePath));
+
+CacheConfig cacheConfig = new CacheConfig(hadoopConf);
+cacheConfig.setCacheDataInL1(false);
+HoodieHFileReader reader = new HoodieHFileReader(hadoopConf, hfilePath, 
cacheConfig);
+long duration  = 0;
+int numRuns = 1000;
+long numRecordsInRange = 0;
+for (int i = 0; i < numRuns; i++) {
+  int partitionPicked = Math.max(0, partitions.size() - 30);
+  long start = System.currentTimeMillis();
+  Map records = 
reader.getRecordsInRange(partitions.get(partitionPicked), 
partitions.get(partitions.size() - 1));
+  duration += (System.currentTimeMillis() - start);
+  numRecordsInRange += records.size();
+}
+double avgDuration = duration / (double) numRuns;
+double avgRecordsFetched = numRecordsInRange / (double) numRuns;
+System.out.println("Average time taken to lookup a range: " + avgDuration 
+ "ms. Avg number records: " + avgRecordsFetched);
+  }
+
+  // key: partition (partition), value: map (filePath -> column range)
+  @Test
+  public void testPerformanceRangeKeyPartition() throws IOException {
+final String instantTime = "100";
+final HoodieWriteConfig cfg = getConfig();
+final Path hfilePath = new Path(basePath + 
"/hfile_partition/f1_1-0-1_000.hfile");
+HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient);
+HoodieHFileWriter hfileWriter = 
(HoodieHFileWriter) 
HoodieFileWriterFactory.getFileWriter(instantTime,
+hfilePath, table, cfg, HoodieMetadataRecord.SCHEMA$, supplier);
+Random random = new Random();
+
+int numPartitions = 1;
+int avgFilesPerPartition = 10;
+
+long startTime = System.currentTimeMillis();
+List partitions = new ArrayList<>();
+for (int i = 0; i < numPartitions; i++) {
+  String partitionPath = "partition-" + String.format("%010d", i);
+  partitions.add(partitionPath);
+  Map fileToRangeInfo = new HashMap<>();
+  for (int j = 0; j < avgFilesPerPartition; j++) {
+String filePath = "file-" + String.format("%010d", j) + 
"_1-0-1_000.parquet";
+int max = random.nextInt();
+if (max < 0) {
+  max = -max;
+}
+int min = random.nextInt(max);
+fileToRangeInfo.put(filePath, 
HoodieRangeIndexInfo.newBuilder().setMax("" + max).setMin("" + 
min).setIsDeleted(false).build());
+  }
+
+  HoodieKey key = new HoodieKey(partitionPath, partitionPath);
+  GenericRecord rec = new GenericData.Record(HoodieMetadataRecord.SCHEMA$);
+

[GitHub] [hudi] satishkotha opened a new pull request #2635: [WIP] [DO NOT MERGE] sample code for measuring hfile performance with column ranges

2021-03-04 Thread GitBox



satishkotha opened a new pull request #2635:
URL: https://github.com/apache/hudi/pull/2635


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1636) Support Builder Pattern To Build Table Properties For HoodieTableConfig

2021-03-04 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-1636:
---
Fix Version/s: 0.8.0

> Support  Builder Pattern  To Build Table Properties For HoodieTableConfig
> -
>
> Key: HUDI-1636
> URL: https://issues.apache.org/jira/browse/HUDI-1636
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Currently we use HoodieTableMetaClient#initTableType to init the table 
> properties. As the number of Table Properties increases，It is hard to 
> maintain the code for the initTableType method, especially when we add a new 
> table property. Support Builder Pattern to build the table config properties 
> may solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch master updated: [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596)

2021-03-04 Thread vinoyang

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new bc883db  [HUDI-1636] Support Builder Pattern To Build Table Properties 
For HoodieTableConfig (#2596)
bc883db is described below

commit bc883db5de5832fa429bbb04a35d3606fdacdb2a
Author: pengzhiwei 
AuthorDate: Fri Mar 5 14:10:27 2021 +0800

[HUDI-1636] Support Builder Pattern To Build Table Properties For 
HoodieTableConfig (#2596)
---
 .../org/apache/hudi/cli/commands/TableCommand.java |  12 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |  11 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |  23 +-
 .../java/org/apache/hudi/client/TestMultiFS.java   |  14 +-
 .../hudi/client/TestTableSchemaEvolution.java  |  15 +-
 .../hudi/testutils/FunctionalTestHarness.java  |  12 +-
 .../hudi/common/table/HoodieTableMetaClient.java   | 276 +
 .../table/timeline/TestHoodieActiveTimeline.java   |   8 +-
 .../hudi/common/testutils/HoodieTestUtils.java |   9 +-
 .../java/HoodieJavaWriteClientExample.java |   7 +-
 .../examples/spark/HoodieWriteClientExample.java   |   7 +-
 .../java/org/apache/hudi/util/StreamerUtil.java|  16 +-
 .../hudi/integ/testsuite/HoodieTestSuiteJob.java   |   8 +-
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  26 +-
 .../hudi/functional/TestStreamingSource.scala  |  14 +-
 .../apache/hudi/hive/testutils/HiveTestUtil.java   |  22 +-
 .../apache/hudi/utilities/HDFSParquetImporter.java |   8 +-
 .../utilities/deltastreamer/BootstrapExecutor.java |  14 +-
 .../hudi/utilities/deltastreamer/DeltaSync.java|  20 +-
 .../functional/TestHoodieSnapshotExporter.java |   9 +-
 .../utilities/testutils/UtilitiesTestBase.java |   7 +-
 21 files changed, 341 insertions(+), 197 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java
index 168de26..d25e0c8 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java
@@ -22,7 +22,6 @@ import org.apache.hudi.cli.HoodieCLI;
 import org.apache.hudi.cli.HoodiePrintHelper;
 import org.apache.hudi.cli.TableHeader;
 import org.apache.hudi.common.fs.ConsistencyGuardConfig;
-import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.exception.TableNotFoundException;
 
@@ -106,10 +105,13 @@ public class TableCommand implements CommandMarker {
   throw new IllegalStateException("Table already existing in path : " + 
path);
 }
 
-final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr);
-HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, 
archiveFolder,
-payloadClass, layoutVersion);
-
+HoodieTableMetaClient.withPropertyBuilder()
+  .setTableType(tableTypeStr)
+  .setTableName(name)
+  .setArchiveLogFolder(archiveFolder)
+  .setPayloadClassName(payloadClass)
+  .setTimelineLayoutVersion(layoutVersion)
+  .initTable(HoodieCLI.conf, path);
 // Now connect to ensure loading works
 return connect(path, layoutVersion, false, 0, 0, 0);
   }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 6629410..dbd678f 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -288,9 +288,14 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 String createInstantTime = 
latestInstant.map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP);
 LOG.info("Creating a new metadata table in " + 
metadataWriteConfig.getBasePath() + " at instant " + createInstantTime);
 
-HoodieTableMetaClient.initTableType(hadoopConf.get(), 
metadataWriteConfig.getBasePath(),
-HoodieTableType.MERGE_ON_READ, tableName, "archived", 
HoodieMetadataPayload.class.getName(),
-HoodieFileFormat.HFILE.toString());
+HoodieTableMetaClient.withPropertyBuilder()
+  .setTableType(HoodieTableType.MERGE_ON_READ)
+  .setTableName(tableName)
+  .setArchiveLogFolder("archived")
+  .setPayloadClassName(HoodieMetadataPayload.class.getName())
+  .setBaseFileFormat(HoodieFileFormat.HFILE.toString())
+  .initTable(hadoopConf.get(), metadataWriteConfig.getBasePath());
+
 initTableMetadata();
 
 // List all partitions in the basePath of the containing dataset
diff --git

[jira] [Closed] (HUDI-1636) Support Builder Pattern To Build Table Properties For HoodieTableConfig

2021-03-04 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1636.
--
Resolution: Done

> Support  Builder Pattern  To Build Table Properties For HoodieTableConfig
> -
>
> Key: HUDI-1636
> URL: https://issues.apache.org/jira/browse/HUDI-1636
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Currently we use HoodieTableMetaClient#initTableType to init the table 
> properties. As the number of Table Properties increases，It is hard to 
> maintain the code for the initTableType method, especially when we add a new 
> table property. Support Builder Pattern to build the table config properties 
> may solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] yanghua merged pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig

2021-03-04 Thread GitBox



yanghua merged pull request #2596:
URL: https://github.com/apache/hudi/pull/2596


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig

2021-03-04 Thread GitBox



yanghua commented on a change in pull request #2596:
URL: https://github.com/apache/hudi/pull/2596#discussion_r588054717



##
File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/java/HoodieJavaWriteClientExample.java
##
@@ -72,8 +72,11 @@ public static void main(String[] args) throws Exception {
 Path path = new Path(tablePath);
 FileSystem fs = FSUtils.getFs(tablePath, hadoopConf);
 if (!fs.exists(path)) {
-  HoodieTableMetaClient.initTableType(hadoopConf, tablePath, 
HoodieTableType.valueOf(tableType),
-  tableName, HoodieAvroPayload.class.getName());
+  HoodieTableConfig.propertyBuilder()
+.setTableType(tableType)
+.setTableName(tableName)
+.setPayloadClassName(HoodieAvroPayload.class.getName())
+.initTable(hadoopConf, tablePath);

Review comment:
   Agree to move forward and do some subsequence refactor.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (#2621)

2021-03-04 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f53bca4  [HUDI-1655] Support custom date format and fix unsupported 
exception in DatePartitionPathSelector (#2621)
f53bca4 is described below

commit f53bca404f1482e0e99ad683dd29bfaff8bfb8ab
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Mar 4 21:01:51 2021 -0800

[HUDI-1655] Support custom date format and fix unsupported exception in 
DatePartitionPathSelector (#2621)

- Add a config to allow parsing custom date format in 
`DatePartitionPathSelector`. Currently it assumes date partition string in the 
format of `-MM-dd`.
- Fix a bug where `UnsupportedOperationException` was thrown when sort 
`eligibleFiles` in-place. Changed to sort it and store in a new list.
---
 .../sources/helpers/DatePartitionPathSelector.java | 30 +++--
 .../helpers/TestDatePartitionPathSelector.java | 38 --
 2 files changed, 41 insertions(+), 27 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
index 2cedb6c..c22657f 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
@@ -35,13 +35,16 @@ import org.apache.log4j.Logger;
 import org.apache.spark.api.java.JavaSparkContext;
 
 import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
 import java.util.ArrayList;
 import java.util.Comparator;
 import java.util.List;
 import java.util.stream.Collectors;
 
 import static 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector.Config.ROOT_INPUT_PATH_PROP;
+import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DATE_FORMAT;
 import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DATE_PARTITION_DEPTH;
+import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_DATE_FORMAT;
 import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_DATE_PARTITION_DEPTH;
 import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_LOOKBACK_DAYS;
 import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.Config.DEFAULT_PARTITIONS_LIST_PARALLELISM;
@@ -59,12 +62,16 @@ import static 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelecto
  * The date based partition is expected to be of the format '=-mm-dd' or
  * '-mm-dd'. The date partition can be at any level. For ex. the partition 
path can be of the
  * form 
`` or
- * `/

[GitHub] [hudi] xushiyan merged pull request #2621: [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector

2021-03-04 Thread GitBox



xushiyan merged pull request #2621:
URL: https://github.com/apache/hudi/pull/2621


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2625: [1568] Fixing spark3 bundles

2021-03-04 Thread GitBox



codecov-io edited a comment on pull request #2625:
URL: https://github.com/apache/hudi/pull/2625#issuecomment-789780238


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=h1) Report
   > Merging 
[#2625](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=desc) (369d44f) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/527175ab0be59d15f035a09a0466bfeca9abfb23?el=desc)
 (527175a) will **decrease** coverage by `4.81%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2625/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2625  +/-   ##
   
   - Coverage 50.94%   46.13%   -4.82% 
   - Complexity 3169 3171   +2 
   
 Files   433  462  +29 
 Lines 1981421782+1968 
 Branches   2034 2315 +281 
   
   - Hits  1009510049  -46 
   - Misses 890010883+1983 
   - Partials819  850  +31 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.45% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `50.31% <ø> (+7.09%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.48% <ø> (+0.32%)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.67% <ø> (-0.07%)` | `0.00 <ø> (ø)` | |
   | hudisync | `49.62% <ø> (+1.00%)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (-2.13%)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2625: [1568] Fixing spark3 bundles

2021-03-04 Thread GitBox



codecov-io edited a comment on pull request #2625:
URL: https://github.com/apache/hudi/pull/2625#issuecomment-789780238


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=h1) Report
   > Merging 
[#2625](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=desc) (369d44f) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/527175ab0be59d15f035a09a0466bfeca9abfb23?el=desc)
 (527175a) will **decrease** coverage by `6.62%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2625/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2625  +/-   ##
   
   - Coverage 50.94%   44.32%   -6.63% 
   + Complexity 3169 2976 -193 
   
 Files   433  426   -7 
 Lines 1981420229 +415 
 Branches   2034 2114  +80 
   
   - Hits  10095 8967-1128 
   - Misses 890010525+1625 
   + Partials819  737  -82 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.45% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `50.31% <ø> (+7.09%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.48% <ø> (+0.32%)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `49.62% <ø> (+1.00%)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (-2.13%)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2625: [1568] Fixing spark3 bundles

2021-03-04 Thread GitBox



codecov-io edited a comment on pull request #2625:
URL: https://github.com/apache/hudi/pull/2625#issuecomment-789780238


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=h1) Report
   > Merging 
[#2625](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=desc) (369d44f) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/527175ab0be59d15f035a09a0466bfeca9abfb23?el=desc)
 (527175a) will **decrease** coverage by `7.40%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2625/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2625  +/-   ##
   
   - Coverage 50.94%   43.54%   -7.41% 
   + Complexity 3169 2786 -383 
   
 Files   433  405  -28 
 Lines 1981418718-1096 
 Branches   2034 1969  -65 
   
   - Hits  10095 8151-1944 
   - Misses 8900 9905+1005 
   + Partials819  662 -157 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.45% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `50.31% <ø> (+7.09%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.48% <ø> (+0.32%)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2625?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2625/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   |

[GitHub] [hudi] nsivabalan edited a comment on pull request #2625: [1568] Fixing spark3 bundles

2021-03-04 Thread GitBox



nsivabalan edited a comment on pull request #2625:
URL: https://github.com/apache/hudi/pull/2625#issuecomment-789746642


   CC @vinothchandar @garyli1019 @bvaradar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan removed a comment on pull request #2625: [1568] Adding hudi-spark3-bundle

2021-03-04 Thread GitBox



nsivabalan removed a comment on pull request #2625:
URL: https://github.com/apache/hudi/pull/2625#issuecomment-789746287


   @umehrot2 @zhedoubushishi : Can you folks help me out here. I see we have 
added support for spark3 [here](https://github.com/apache/hudi/pull/2208). Did 
we test the bundle? I do see both spark2 and spark3 are included in pom of 
hudi-spark-bundle, but how do I generate the jars for spark3. 
   ```
   org.apache.hudi:hudi-spark_${scala.binary.version}
   org.apache.hudi:hudi-spark2_${scala.binary.version}
   org.apache.hudi:hudi-spark3_2.12
   ```
   because, if I run below command, all I see is regular spark-bundle which 
fails for MOR. So, I guess its not spark3. 
   ```
   mvn clean package -DskipTests -Dscala-2.12 -Dspark3
   ```
   With this patch, I have added a new bundle altogether and it works. But 
wanted to see if I am missing something here. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1666) Refactor BaseCleanActionExecutor to return List

2021-03-04 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1666:
-

 Summary: Refactor BaseCleanActionExecutor to return 
List
 Key: HUDI-1666
 URL: https://issues.apache.org/jira/browse/HUDI-1666
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Cleaner
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



xiarixiaoyao edited a comment on pull request #2634:
URL: https://github.com/apache/hudi/pull/2634#issuecomment-791119070


   cc @garyli1019 ,  could you take a look



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



xiarixiaoyao edited a comment on pull request #2634:
URL: https://github.com/apache/hudi/pull/2634#issuecomment-791119070







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



xiarixiaoyao commented on pull request #2634:
URL: https://github.com/apache/hudi/pull/2634#issuecomment-791119070


   cc @garyli1019 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1665) Remove autoCommit option from BaseCommitActionExecutor

2021-03-04 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1665:
-

 Summary: Remove autoCommit option from BaseCommitActionExecutor
 Key: HUDI-1665
 URL: https://issues.apache.org/jira/browse/HUDI-1665
 Project: Apache Hudi
  Issue Type: Bug
  Components: Writer Core
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal


Currently, the option to autoCommit inside the action-executor framework 
creates 2 different code paths for committing data in Hudi a) From 
AbstractHoodieWriteClient b) BaseCommitActionExecutor. 


We should refactor and remove the commit code path from ActionExecutor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1486) Remove pending rollback and move to cleaner

2021-03-04 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1486:
-

Assignee: Nishith Agarwal

> Remove pending rollback and move to cleaner
> ---
>
> Key: HUDI-1486
> URL: https://issues.apache.org/jira/browse/HUDI-1486
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1662:
-
Labels: pull-request-available  (was: )

>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> use beeline/spark-sql   execute   statement select * from 
> huditest.bulkinsert_mor_10g_rt where primary_key = 1000;
> then the follow error will occur:
> _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_
>  
>   
>  Root cause analysis：
> hudi use avro format to store log file， avro store DateType as INT（Type is 
> INT but logcialType is date）。
> when hudi read log file and convert avro INT type record to 
> writable，logicalType is not respected which lead the dateType will cast to 
> IntWritable。
> seem： 
> [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
>   
>  Modification plan： when cast avro INT type  to writable,  logicalType must 
> be  considerd
> case INT:
>  if (schema.getLogicalType() != null && 
> schema.getLogicalType().getName().equals("date")) {
>  return new DateWritable((Integer) value);
>  } else {
>  return new IntWritable((Integer) value);
>  }
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1486) Remove pending rollback and move to cleaner

2021-03-04 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1486:
--
Status: Patch Available  (was: In Progress)

> Remove pending rollback and move to cleaner
> ---
>
> Key: HUDI-1486
> URL: https://issues.apache.org/jira/browse/HUDI-1486
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1486) Remove pending rollback and move to cleaner

2021-03-04 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1486:
--
Status: Open  (was: New)

> Remove pending rollback and move to cleaner
> ---
>
> Key: HUDI-1486
> URL: https://issues.apache.org/jira/browse/HUDI-1486
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1486) Remove pending rollback and move to cleaner

2021-03-04 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1486:
--
Status: In Progress  (was: Open)

> Remove pending rollback and move to cleaner
> ---
>
> Key: HUDI-1486
> URL: https://issues.apache.org/jira/browse/HUDI-1486
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xiarixiaoyao opened a new pull request #2634: [HUDI-1662] Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread GitBox



xiarixiaoyao opened a new pull request #2634:
URL: https://github.com/apache/hudi/pull/2634


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   fixed the bug that: Failed to query real-time view use hive/spark-sql when 
hudi mor table contains dateType
   test step:
   step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
   
   df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
   
   merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
   
   step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
   
    df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))
   
   merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
   
   step3: use hive-beeeline/ spark-sql query mor_rt table
   
   use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;
   
   then the follow error will occur:
   
   java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be 
cast to org.apache.hadoop.hive.serde2.io.DateWritableV2
   
   ## Brief change log
   when hudi read log file and convert avro INT type record to 
writable，logicalType is not respected which lead the dateType will cast to 
IntWritable。so cast avro INT type  to writable,  logicalType must be  considered
   
   ## Verify this pull request
   Existing UT tests
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1655) Support custom date format and fix unsupported exception in DatePartitionPathSelector

2021-03-04 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1655:
-
Description: 
Add a config to allow parsing custom date format in 
{{DatePartitionPathSelector}}. Currently it assumes date partition string in 
the format of {{-MM-dd}}.

 

Also eligibleFiles.sort() throws this exception
{quote}java.lang.UnsupportedOperationException at 
java.util.AbstractList.set(AbstractList.java:132) at 
java.util.AbstractList$ListItr.set(AbstractList.java:426) at 
java.util.List.sort(List.java:482) at 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
 at 
org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
 at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) 
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) 
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
{quote}
 

{{org.apache.hudi.client.common.HoodieSparkEngineContext#flatMap}} returns a 
list that can't be sorted in-place.

  was:
Add a config to allow parsing custom date format in 
{{DatePartitionPathSelector}}. Currently it assumes date partition string in 
the format of {{-MM-dd}}.

 

Also eligibleFiles.sort() throws this exception
{quote}java.lang.UnsupportedOperationException at 
java.util.AbstractList.set(AbstractList.java:132) at 
java.util.AbstractList$ListItr.set(AbstractList.java:426) at 
java.util.List.sort(List.java:482) at 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
 at 
org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
 at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) 
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) 
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
{quote}


> Support custom date format and fix unsupported exception in 
> DatePartitionPathSelector
> -
>
> Key: HUDI-1655
> URL: https://issues.apache.org/jira/browse/HUDI-1655
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Add a config to allow parsing custom date format in 
> {{DatePartitionPathSelector}}. Currently it assumes date partition string in 
> the format of {{-MM-dd}}.
>  
> Also eligibleFiles.sort() throws this exception
> {quote}java.lang.UnsupportedOperationException at 
> java.util.AbstractList.set(AbstractList.java:132) at 
> java.util.AbstractList$ListItr.set(AbstractList.java:426) at 
> java.util.List.sort(List.java:482) at 
> org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
>  at 
> org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
>  at 
> org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) 
> at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
> {quote}
>  
> {{org.apache.hudi.client.common.HoodieSparkEngineContext#flatMap}} returns a 
> list that can't be sorted in-place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1655) Support custom date format and fix unsupported exception in DatePartitionPathSelector

2021-03-04 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1655:
-
Summary: Support custom date format and fix unsupported exception in 
DatePartitionPathSelector  (was: Allow custom date format in 
DatePartitionPathSelector)

> Support custom date format and fix unsupported exception in 
> DatePartitionPathSelector
> -
>
> Key: HUDI-1655
> URL: https://issues.apache.org/jira/browse/HUDI-1655
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Add a config to allow parsing custom date format in 
> {{DatePartitionPathSelector}}. Currently it assumes date partition string in 
> the format of {{-MM-dd}}.
>  
> Also eligibleFiles.sort() throws this exception
> {quote}java.lang.UnsupportedOperationException at 
> java.util.AbstractList.set(AbstractList.java:132) at 
> java.util.AbstractList$ListItr.set(AbstractList.java:426) at 
> java.util.List.sort(List.java:482) at 
> org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
>  at 
> org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
>  at 
> org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) 
> at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xushiyan commented on pull request #2621: [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector

2021-03-04 Thread GitBox



xushiyan commented on pull request #2621:
URL: https://github.com/apache/hudi/pull/2621#issuecomment-791115825


   > @xushiyan It would be also good to change the title of the PR to add more 
information about fixing the bug?
   
   @yanghua ok done. Pls check. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1655) Allow custom date format in DatePartitionPathSelector

2021-03-04 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1655:
-
Description: 
Add a config to allow parsing custom date format in 
{{DatePartitionPathSelector}}. Currently it assumes date partition string in 
the format of {{-MM-dd}}.

 

Also eligibleFiles.sort() throws this exception
{quote}java.lang.UnsupportedOperationException at 
java.util.AbstractList.set(AbstractList.java:132) at 
java.util.AbstractList$ListItr.set(AbstractList.java:426) at 
java.util.List.sort(List.java:482) at 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
 at 
org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
 at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) 
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) 
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
{quote}

  was:Add a config to allow parsing custom date format in 
{{DatePartitionPathSelector}}. Currently it assumes date partition string in 
the format of {{-MM-dd}}.


> Allow custom date format in DatePartitionPathSelector
> -
>
> Key: HUDI-1655
> URL: https://issues.apache.org/jira/browse/HUDI-1655
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Add a config to allow parsing custom date format in 
> {{DatePartitionPathSelector}}. Currently it assumes date partition string in 
> the format of {{-MM-dd}}.
>  
> Also eligibleFiles.sort() throws this exception
> {quote}java.lang.UnsupportedOperationException at 
> java.util.AbstractList.set(AbstractList.java:132) at 
> java.util.AbstractList$ListItr.set(AbstractList.java:426) at 
> java.util.List.sort(List.java:482) at 
> org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
>  at 
> org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
>  at 
> org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) 
> at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at 
> org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] yanghua commented on pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector

2021-03-04 Thread GitBox



yanghua commented on pull request #2621:
URL: https://github.com/apache/hudi/pull/2621#issuecomment-791113338


   @xushiyan It would be also good to change the title of the PR to add more 
information about fixing the bug?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector

2021-03-04 Thread GitBox



yanghua commented on a change in pull request #2621:
URL: https://github.com/apache/hudi/pull/2621#discussion_r587994448



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
##
@@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, 
Configuration hadoopConf
   FileSystem fs = new Path(path).getFileSystem(serializedConf.get());
   return listEligibleFiles(fs, new Path(path), 
lastCheckpointTime).stream();
 }, partitionsListParallelism);
-// sort them by modification time.
-
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));
+// sort them by modification time ascending.
+List sortedEligibleFiles = eligibleFiles.stream()
+
.sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList());

Review comment:
   OK, would you please add it into the ticket's description for the 
archiving purpose?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] root18039532923 commented on issue #2614: Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token

2021-03-04 Thread GitBox



root18039532923 commented on issue #2614:
URL: https://github.com/apache/hudi/issues/2614#issuecomment-791105284


   I need the jar of the patch which you opened to test,but I am using 
inner-net. @satishkotha 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1662:
---
Description: 
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;

then the follow error will occur:

_java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.DateWritableV2_

 
  
 Root cause analysis：

hudi use avro format to store log file， avro store DateType as INT（Type is INT 
but logcialType is date）。

when hudi read log file and convert avro INT type record to 
writable，logicalType is not respected which lead the dateType will cast to 
IntWritable。

seem： 
[https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
  
 Modification plan： when cast avro INT type  to writable,  logicalType must be  
considerd

case INT:
 if (schema.getLogicalType() != null && 
schema.getLogicalType().getName().equals("date")) {
 return new DateWritable((Integer) value);
 } else {
 return new IntWritable((Integer) value);
 }

 

 

 

 

  was:
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;

then the follow error will occur:

_java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.DateWritableV2_

 
 
Root cause analysis：

hudi use avro format to store log file， avro store DateType as INT（Type is INT 
but logcialType is date）。

when hudi read log file and convert avro INT type record to 
writable，logicalType is not respected which lead the dateType will cast to 
IntWritable。

seem： 
[https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
 
Modification plan： when cast avro INT type  to writable,  logicalType must be  
considerd
if (schema.getLogicalType() != null && schema.getLogicalType().getName() == 
"date") {
 return new DateWritable((Integer) value);
} else {
 return new IntWritable((Integer) value);
}
 

 

 

 

 


>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> use beeline/spark-sql   execute   statement select * from 
> huditest.bulkinsert_mor_10g_rt where primary_key = 1000;
> then the follow error will occur:
> _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_
>  
>   
>  Root cause analysis：
> hudi use avro format to store log file， avro store DateType as INT（Type is 
> INT but logcialType is date）。
> when hudi read log file and convert avro INT type record to 
> writable，logicalType is not respected which lead the dateType will cast

[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1662:
---
Description: 
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;

then the follow error will occur:

_java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.DateWritableV2_

 
 
Root cause analysis：

hudi use avro format to store log file， avro store DateType as INT（Type is INT 
but logcialType is date）。

when hudi read log file and convert avro INT type record to 
writable，logicalType is not respected which lead the dateType will cast to 
IntWritable。

seem： 
[https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
 
Modification plan： when cast avro INT type  to writable,  logicalType must be  
considerd
if (schema.getLogicalType() != null && schema.getLogicalType().getName() == 
"date") {
 return new DateWritable((Integer) value);
} else {
 return new IntWritable((Integer) value);
}
 

 

 

 

 

  was:
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!image-2021-03-05-10-06-11-949.png!


>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> use beeline/spark-sql   execute   statement select * from 
> huditest.bulkinsert_mor_10g_rt where primary_key = 1000;
> then the follow error will occur:
> _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_
>  
>  
> Root cause analysis：
> hudi use avro format to store log file， avro store DateType as INT（Type is 
> INT but logcialType is date）。
> when hudi read log file and convert avro INT type record to 
> writable，logicalType is not respected which lead the dateType will cast to 
> IntWritable。
> seem： 
> [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
>  
> Modification plan： when cast avro INT type  to writable,  logicalType must be 
>  considerd
> if (schema.getLogicalType() != null && schema.getLogicalType().getName() == 
> "date") {
>  return new DateWritable((Integer) value);
> } else {
>  return new IntWritable((Integer) value);
> }
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-1664) Streaming read for Flink MOR table

2021-03-04 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-1664.

Resolution: Duplicate

> Streaming read for Flink MOR table
> --
>
> Key: HUDI-1664
> URL: https://issues.apache.org/jira/browse/HUDI-1664
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.8.0
>
>
> Supports reading as streaming for Flink MOR table. The writer writes avro 
> logs and reader monitor the latest commits in the timeline the assign the 
> split reading task for the incremental logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1663) Streaming read for Flink MOR table

2021-03-04 Thread Danny Chen (Jira)

Danny Chen created HUDI-1663:


 Summary: Streaming read for Flink MOR table
 Key: HUDI-1663
 URL: https://issues.apache.org/jira/browse/HUDI-1663
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Flink Integration
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.8.0


Supports reading as streaming for Flink MOR table. The writer writes avro logs 
and reader monitor the latest commits in the timeline the assign the split 
reading task for the incremental logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1664) Streaming read for Flink MOR table

2021-03-04 Thread Danny Chen (Jira)

Danny Chen created HUDI-1664:


 Summary: Streaming read for Flink MOR table
 Key: HUDI-1664
 URL: https://issues.apache.org/jira/browse/HUDI-1664
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Flink Integration
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.8.0


Supports reading as streaming for Flink MOR table. The writer writes avro logs 
and reader monitor the latest commits in the timeline the assign the split 
reading task for the incremental logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1662:
---
Description: 
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!image-2021-03-05-10-06-11-949.png!

  was:
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")


step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!file:///C:/Users/m00443775/AppData/Roaming/eSpace_Desktop/UserData/m00443775/imagefiles/847287BC-86FF-4EFF-9BDC-C2A60FFD3F47.png|id=847287BC-86FF-4EFF-9BDC-C2A60FFD3F47,vspace=3!


>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> !image-2021-03-05-10-06-11-949.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)

tao meng created HUDI-1662:
--

 Summary:  Failed to query real-time view use hive/spark-sql when 
hudi mor table contains dateType
 Key: HUDI-1662
 URL: https://issues.apache.org/jira/browse/HUDI-1662
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.7.0
 Environment: hive 3.1.1
spark 2.4.5
hadoop 3.1.1
suse os
Reporter: tao meng


step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")


step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!file:///C:/Users/m00443775/AppData/Roaming/eSpace_Desktop/UserData/m00443775/imagefiles/847287BC-86FF-4EFF-9BDC-C2A60FFD3F47.png|id=847287BC-86FF-4EFF-9BDC-C2A60FFD3F47,vspace=3!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch master updated (89003bc -> 7cc75e0)

2021-03-04 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 89003bc  [HUDI-1647] Supports snapshot read for Flink (#2613)
 add 7cc75e0  [HUDI-1646] Provide mechanism to read uncommitted data 
through InputFormat (#2611)

No new revisions were added by this update.

Summary of changes:
 .../table/timeline/HoodieDefaultTimeline.java  |  7 
 .../hudi/common/table/timeline/HoodieTimeline.java |  5 +++
 .../common/table/view/FileSystemViewManager.java   | 17 ++---
 .../apache/hudi/hadoop/utils/HoodieHiveUtils.java  | 22 
 .../hudi/hadoop/utils/HoodieInputFormatUtils.java  |  3 +-
 .../hudi/hadoop/TestHoodieParquetInputFormat.java  | 41 ++
 .../hudi/hadoop/testutils/InputFormatTestUtil.java | 20 +++
 7 files changed, 109 insertions(+), 6 deletions(-)

[GitHub] [hudi] n3nash merged pull request #2611: [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat

2021-03-04 Thread GitBox



n3nash merged pull request #2611:
URL: https://github.com/apache/hudi/pull/2611


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector

2021-03-04 Thread GitBox



xushiyan commented on a change in pull request #2621:
URL: https://github.com/apache/hudi/pull/2621#discussion_r587949226



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
##
@@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, 
Configuration hadoopConf
   FileSystem fs = new Path(path).getFileSystem(serializedConf.get());
   return listEligibleFiles(fs, new Path(path), 
lastCheckpointTime).stream();
 }, partitionsListParallelism);
-// sort them by modification time.
-
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));
+// sort them by modification time ascending.
+List sortedEligibleFiles = eligibleFiles.stream()
+
.sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList());

Review comment:
   @yanghua yes it resulted in this error
   
   ```
   java.lang.UnsupportedOperationException
at java.util.AbstractList.set(AbstractList.java:132)
at java.util.AbstractList$ListItr.set(AbstractList.java:426)
at java.util.List.sort(List.java:482)
at 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
at 
org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
at 
org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75)
at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
   ```
   `org.apache.hudi.client.common.HoodieSparkEngineContext#flatMap` returns a 
list that can't be sorted in-place.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector

2021-03-04 Thread GitBox



xushiyan commented on a change in pull request #2621:
URL: https://github.com/apache/hudi/pull/2621#discussion_r587949226



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
##
@@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, 
Configuration hadoopConf
   FileSystem fs = new Path(path).getFileSystem(serializedConf.get());
   return listEligibleFiles(fs, new Path(path), 
lastCheckpointTime).stream();
 }, partitionsListParallelism);
-// sort them by modification time.
-
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));
+// sort them by modification time ascending.
+List sortedEligibleFiles = eligibleFiles.stream()
+
.sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList());

Review comment:
   yes it resulted in this error
   
   ```
   java.lang.UnsupportedOperationException
at java.util.AbstractList.set(AbstractList.java:132)
at java.util.AbstractList$ListItr.set(AbstractList.java:426)
at java.util.List.sort(List.java:482)
at 
org.apache.hudi.utilities.sources.helpers.DatePartitionPathSelector.getNextFilePathsAndMaxModificationTime(DatePartitionPathSelector.java:141)
at 
org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
at 
org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75)
at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:587)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-1647) Supports snapshot read for Flink

2021-03-04 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1647.
--
Resolution: Implemented

Implemented via master branch: 89003bc7801b035b5be31c76bfbf691bfcf9081a

> Supports snapshot read for Flink
> 
>
> Key: HUDI-1647
> URL: https://issues.apache.org/jira/browse/HUDI-1647
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Support snapshot read for Flink for both MOR and COW table.
> - COW: the parquet files for the latest file group slices
> - MOR: the parquet base file + log files for  the latest file group slices
> Also implements the SQL connectors for both slink and source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1647) Supports snapshot read for Flink

2021-03-04 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-1647:
---
Fix Version/s: 0.8.0

> Supports snapshot read for Flink
> 
>
> Key: HUDI-1647
> URL: https://issues.apache.org/jira/browse/HUDI-1647
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Support snapshot read for Flink for both MOR and COW table.
> - COW: the parquet files for the latest file group slices
> - MOR: the parquet base file + log files for  the latest file group slices
> Also implements the SQL connectors for both slink and source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] yanghua merged pull request #2613: [HUDI-1647] Supports snapshot read for Flink

2021-03-04 Thread GitBox



yanghua merged pull request #2613:
URL: https://github.com/apache/hudi/pull/2613


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (899ae70 -> 89003bc)

2021-03-04 Thread vinoyang

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 899ae70  [HUDI-1587] Add latency and freshness support (#2541)
 add 89003bc  [HUDI-1647] Supports snapshot read for Flink (#2613)

No new revisions were added by this update.

Summary of changes:
 hudi-flink/pom.xml |  41 +-
 .../apache/hudi/factory/HoodieTableFactory.java|  84 
 .../org/apache/hudi/operator/FlinkOptions.java |  67 ++-
 .../apache/hudi/operator/StreamWriteFunction.java  |   8 +
 .../apache/hudi/operator/StreamWriteOperator.java  |   8 +-
 .../operator/StreamWriteOperatorCoordinator.java   |  34 +-
 .../hudi/operator/StreamWriteOperatorFactory.java  |  10 +-
 .../operator/partitioner/BucketAssignFunction.java |   1 -
 .../java/org/apache/hudi/sink/HoodieTableSink.java | 126 +
 .../org/apache/hudi/source/HoodieTableSource.java  | 411 +
 .../apache/hudi/source/format/FilePathUtils.java   | 320 +
 .../org/apache/hudi/source/format/FormatUtils.java |  98 
 .../source/format/cow/AbstractColumnReader.java| 324 +
 .../source/format/cow/CopyOnWriteInputFormat.java  | 134 ++
 .../format/cow/Int64TimestampColumnReader.java |  99 
 .../format/cow/ParquetColumnarRowSplitReader.java  | 370 +++
 .../source/format/cow/ParquetDecimalVector.java|  69 +++
 .../source/format/cow/ParquetSplitReaderUtil.java  | 398 
 .../hudi/source/format/cow/RunLengthDecoder.java   | 304 
 .../source/format/mor/MergeOnReadInputFormat.java  | 513 +
 .../source/format/mor/MergeOnReadInputSplit.java   |  88 
 .../source/format/mor/MergeOnReadTableState.java   |  79 
 .../org/apache/hudi/util/AvroSchemaConverter.java  | 180 +++-
 .../apache/hudi/util/AvroToRowDataConverters.java  | 316 +
 .../java/org/apache/hudi/util/StreamerUtil.java|  30 ++
 .../org.apache.flink.table.factories.TableFactory  |   5 +-
 .../apache/hudi/operator/StreamWriteITCase.java|  43 +-
 .../StreamWriteOperatorCoordinatorTest.java|   2 +-
 .../operator/utils/StreamWriteFunctionWrapper.java |   2 +-
 .../hudi/operator/utils/TestConfigurations.java|  52 ++-
 .../org/apache/hudi/operator/utils/TestData.java   |  66 ++-
 .../apache/hudi/source/HoodieDataSourceITCase.java | 162 +++
 .../apache/hudi/source/HoodieTableSourceTest.java  | 122 +
 .../apache/hudi/source/format/InputFormatTest.java | 197 
 .../utils/factory/ContinuousFileSourceFactory.java |  62 +++
 .../hudi/utils/source/ContinuousFileSource.java| 173 +++
 .../org.apache.flink.table.factories.TableFactory  |   7 +-
 style/checkstyle-suppressions.xml  |   1 +
 38 files changed, 4920 insertions(+), 86 deletions(-)
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/factory/HoodieTableFactory.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/sink/HoodieTableSink.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/HoodieTableSource.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/FilePathUtils.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/FormatUtils.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/AbstractColumnReader.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/CopyOnWriteInputFormat.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/Int64TimestampColumnReader.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetColumnarRowSplitReader.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetDecimalVector.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/ParquetSplitReaderUtil.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/cow/RunLengthDecoder.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/mor/MergeOnReadInputFormat.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/mor/MergeOnReadInputSplit.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/source/format/mor/MergeOnReadTableState.java
 create mode 100644 
hudi-flink/src/main/java/org/apache/hudi/util/AvroToRowDataConverters.java
 copy hudi-integ-test/src/test/resources/hoodie-docker.properties => 
hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory
 (94%)
 create mode 100644 
hudi-flink/src/test/java/org/apache/hudi/source/HoodieDataSourceITCase.java
 create mode 100644 
hudi-flink/src/test/java/org/apache/hudi/source/HoodieTableSourceTest.java
 create mode 100644

[hudi] branch asf-site updated: Travis CI build asf-site

2021-03-04 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a2efa69  Travis CI build asf-site
a2efa69 is described below

commit a2efa690d6b6847a7375caa50e9f96729455e539
Author: CI 
AuthorDate: Fri Mar 5 00:39:50 2021 +

Travis CI build asf-site
---
 content/cn/docs/docker_demo.html | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/cn/docs/docker_demo.html b/content/cn/docs/docker_demo.html
index 05f3395..60668ea 100644
--- a/content/cn/docs/docker_demo.html
+++ b/content/cn/docs/docker_demo.html
@@ -574,13 +574,13 @@ inorder to run Hive queries against those datasets.
 docker exec -it adhoc-2 /bin/bash
 
 # THis command takes in 
HIveServer URL and COW Hudi 
Dataset location in HDFS and 
sync the /var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh  --jdbc-/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool. [...]
 .
 2018-09-24 22:22:45,568 INFO  [main] hive.HiveSyncToo [...]
 .
 
 # Now run hive-sync for the 
second data-set in HDFS using Merge-On/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh  --jdbc-/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool. [...]
 ...
 2018-09-24 22:23:09,171 INFO  [main] hive.HiveSyncToo [...]
 ...

[hudi] branch asf-site updated: [DOCS] Update 0_4_docker_demo.cn.md (#2629)

2021-03-04 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 2619964  [DOCS] Update 0_4_docker_demo.cn.md (#2629)
2619964 is described below

commit 2619964b07abc78f345247d634bc729eb3880f4d
Author: Wei 
AuthorDate: Fri Mar 5 08:37:37 2021 +0800

[DOCS] Update 0_4_docker_demo.cn.md (#2629)

Fix docs hive_sync path
---
 docs/_docs/0_4_docker_demo.cn.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/_docs/0_4_docker_demo.cn.md b/docs/_docs/0_4_docker_demo.cn.md
index d0005b2..63bf33e 100644
--- a/docs/_docs/0_4_docker_demo.cn.md
+++ b/docs/_docs/0_4_docker_demo.cn.md
@@ -192,13 +192,13 @@ inorder to run Hive queries against those datasets.
 docker exec -it adhoc-2 /bin/bash
 
 # THis command takes in HIveServer URL and COW Hudi Dataset location in HDFS 
and sync the HDFS state to Hive
-/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow
+/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow
 .
 2018-09-24 22:22:45,568 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_cow
 .
 
 # Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR 
storage)
-/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table 
stock_ticks_mor
+/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_mor --database default --table 
stock_ticks_mor
 ...
 2018-09-24 22:23:09,171 INFO  [main] hive.HiveSyncTool 
(HiveSyncTool.java:syncHoodieTable(112)) - Sync complete for stock_ticks_mor
 ...

[GitHub] [hudi] vinothchandar merged pull request #2629: [DOCS] Fix docs hive_sync path

2021-03-04 Thread GitBox



vinothchandar merged pull request #2629:
URL: https://github.com/apache/hudi/pull/2629


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

2021-03-04 Thread GitBox



bvaradar commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-790975178


   @nsivabalan  is looking into this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 edited a comment on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-03-04 Thread GitBox



umehrot2 edited a comment on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-790955939


   @codejoyan I think the problem stems because you are using 
`org.apache.spark:spark-avro_2.11:2.4.4` in you packages with spark-submit. 
This is incompatible with `spark-sql 2.3.0` that is why you get that 
`NoSuchMethod` exception.
   
   What happens if you use `databricks-avro` package instead of `spark-avro`: 
https://mvnrepository.com/artifact/com.databricks/spark-avro ? Can you give 
that a shot ?
   
   Also @bvaradar is right. When you build your Hudi and use databricks-avro 
package, please build you Hudi with `-Pspark-shade-unbundle-avro` to avoid 
bundling/shading of spark-avro module in hudi-spark-bundle.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-03-04 Thread GitBox



umehrot2 commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-790955939


   @codejoyan I think the problem stems because you are using 
`org.apache.spark:spark-avro_2.11:2.4.4` in you packages with spark-submit. 
This is incompatible with `spark-sql 2.3.0` that is why you get that 
`NoSuchMethod` exception. Bundling/shading this package will not solve the 
problem either because its just incompatible.
   
   What happens if you use `databricks-avro` package instead of `spark-avro`: 
https://mvnrepository.com/artifact/com.databricks/spark-avro ? Can you give 
that a shot ?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.

2021-03-04 Thread GitBox



satishkotha commented on issue #2589:
URL: https://github.com/apache/hudi/issues/2589#issuecomment-790942436


   yeah, column deletions are not supported today. You can consider making all 
columns optional and continue writing null for fields that you want to delete.
   
   @bvaradar Could you take a look at MOR table log files schema issue? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2632: [HUDI-1661] Exclude clustering commits from TimelineUtils API

2021-03-04 Thread GitBox



codecov-io edited a comment on pull request #2632:
URL: https://github.com/apache/hudi/pull/2632#issuecomment-790858581


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=h1) Report
   > Merging 
[#2632](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=desc) (52706ea) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/899ae70fdb70c1511c099a64230fd91b2fe8d4ee?el=desc)
 (899ae70) will **increase** coverage by `0.03%`.
   > The diff coverage is `86.66%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2632/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2632  +/-   ##
   
   + Coverage 51.58%   51.62%   +0.03% 
   - Complexity 3285 3293   +8 
   
 Files   446  446  
 Lines 2040920423  +14 
 Branches   2116 2116  
   
   + Hits  1052810543  +15 
 Misses 9003 9003  
   + Partials878  877   -1 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.46% <86.66%> (+0.04%)` | `0.00 <9.00> (ø)` | |
   | hudiflink | `51.29% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.67% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `49.62% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.59% <ø> (+0.15%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/common/table/timeline/TimelineUtils.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lVXRpbHMuamF2YQ==)
 | `62.71% <86.66%> (+7.15%)` | `21.00 <9.00> (+7.00)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `71.07% <0.00%> (+0.35%)` | `53.00% <0.00%> (+1.00%)` | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `65.69% <0.00%> (+1.16%)` | `33.00% <0.00%> (ø%)` | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on issue #2614: Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token

2021-03-04 Thread GitBox



satishkotha commented on issue #2614:
URL: https://github.com/apache/hudi/issues/2614#issuecomment-790940752


   @root18039532923 Did you get a chance to try this patch? We can merge patch 
after your testing looks good.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2627: [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator

2021-03-04 Thread GitBox



satishkotha commented on a change in pull request #2627:
URL: https://github.com/apache/hudi/pull/2627#discussion_r587802720



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java
##
@@ -20,20 +20,32 @@
 
 import org.apache.avro.generic.GenericRecord;
 import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
 import org.apache.spark.sql.Row;
 
+import java.util.Arrays;
+import java.util.Collections;
 import java.util.List;
+import java.util.stream.Collectors;
 
 /**
  * Simple Key generator for unpartitioned Hive Tables.
  */
-public class NonpartitionedKeyGenerator extends SimpleKeyGenerator {
+public class NonpartitionedKeyGenerator extends BuiltinKeyGenerator {
 
   private final NonpartitionedAvroKeyGenerator nonpartitionedAvroKeyGenerator;
 
-  public NonpartitionedKeyGenerator(TypedProperties config) {
-super(config);
-nonpartitionedAvroKeyGenerator = new 
NonpartitionedAvroKeyGenerator(config);
+  public NonpartitionedKeyGenerator(TypedProperties props) {
+super(props);
+this.recordKeyFields = 
Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY)

Review comment:
   Do you need these initialization here? Can we override 
getPartitionPathFields and getRecordKeyFields and call 
nonPartitionedAvroKeyGenerator instead to avoid code repetition?

##
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/keygen/TestNonpartitionedKeyGenerator.java
##
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.testutils.KeyGeneratorTestUtilities;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import static junit.framework.TestCase.assertEquals;
+
+public class TestNonpartitionedKeyGenerator extends KeyGeneratorTestUtilities {
+
+  private TypedProperties getCommonProps(boolean getComplexRecordKey) {
+TypedProperties properties = new TypedProperties();
+if (getComplexRecordKey) {
+  properties.put(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key, 
pii_col");
+} else {
+  properties.put(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key");
+}
+properties.put(KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, 
"true");
+return properties;
+  }
+
+  private TypedProperties getPropertiesWithoutPartitionPathProp() {
+return getCommonProps(false);
+  }
+
+  private TypedProperties getPropertiesWithPartitionPathProp() {
+TypedProperties properties = getCommonProps(true);
+properties.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"timestamp,ts_ms");
+return properties;
+  }
+
+  private TypedProperties getPropertiesWithoutRecordKeyProp() {
+TypedProperties properties = new TypedProperties();
+properties.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"timestamp");
+return properties;
+  }
+
+  private TypedProperties getWrongRecordKeyFieldProps() {
+TypedProperties properties = new TypedProperties();
+properties.put(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"timestamp");
+properties.put(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY, "_wrong_key");
+return properties;
+  }
+
+  @Test
+  public void testNullRecordKeyFields() {
+Assertions.assertThrows(IllegalArgumentException.class, () -> new 
NonpartitionedKeyGenerator(getPropertiesWithoutRecordKeyProp()));
+  }
+
+  @Test
+  public void testNullPartitionPathFields() {
+TypedProperties properties = getPropertiesWithoutPartitionPathProp();
+NonpartitionedKeyGenerator keyGenerator = new 
NonpartitionedKeyGenerator(properties);
+GenericRecord record = getRecord();
+Row row = KeyGeneratorTestUtilities.getRow(record);

[GitHub] [hudi] vinothchandar edited a comment on issue #2633: Empty File Slice causing application to fail in small files optimization code

2021-03-04 Thread GitBox



vinothchandar edited a comment on issue #2633:
URL: https://github.com/apache/hudi/issues/2633#issuecomment-790926531


   @n3nash so seems to corroborate with udit's finding then. 
   
   cc @bvaradar as well, who can comment on the suggested fix. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2633: Empty File Slice causing application to fail in small files optimization code

2021-03-04 Thread GitBox



vinothchandar commented on issue #2633:
URL: https://github.com/apache/hudi/issues/2633#issuecomment-790926531


   cc @bvaradar as well, who can confirm. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2633: Empty File Slice causing application to fail in small files optimization code

2021-03-04 Thread GitBox



n3nash commented on issue #2633:
URL: https://github.com/apache/hudi/issues/2633#issuecomment-790917128


   @umehrot2 From what you are describing, it looks like a bug. When we 
configure HbaseIndex, we automatically assume 
   
   ```
 public boolean canIndexLogFiles() {
   return true;
 }
   ```
   
   and inserts go into log files. 
   
   But this part of the code -> 
https://github.com/apache/hudi/blob/release-0.6.0/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L190
   is assuming that either 
a) inserts first go into parquet files in which case there will be a 
base file or 
b) Delta file was already written earlier and needs to be re-sized. 
   
   Ideally, this code should create a new log file with the new base instant 
time and send inserts there. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] modi95 commented on pull request #2628: [HUDI-1635] Improvements to Hudi Test Suite

2021-03-04 Thread GitBox



modi95 commented on pull request #2628:
URL: https://github.com/apache/hudi/pull/2628#issuecomment-790916909


   @nbalajee these changes look good to me. Can try doing the following:
   - Run schema evolution test suite in the docker setup (step-by-step guide 
available in test-suite readme) 
   - Add instructions to the test-suite readme on how to use schema evolution



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #2633: Empty File Slice causing application to fail in small files optimization code

2021-03-04 Thread GitBox



umehrot2 commented on issue #2633:
URL: https://github.com/apache/hudi/issues/2633#issuecomment-790891245


   @bvaradar can you confirm that the finding is correct since it seems you 
worked on that file system view implementation ? Also cc @n3nash @vinothchandar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 opened a new issue #2633: Empty File Slice causing application to fail in small files optimization code

2021-03-04 Thread GitBox



umehrot2 opened a new issue #2633:
URL: https://github.com/apache/hudi/issues/2633


   **Describe the problem you faced**
   
   IHAC who is using Hudi's `Spark structured streaming sink` with 
`asynchronous compaction` and `Hbase Index` on EMR. The Hudi version being used 
is 0.6.0. After a while their job fails with the following error:
   
   ```
   java.util.NoSuchElementException: No value present
at java.util.Optional.get(Optional.java:135)
at 
org.apache.hudi.table.action.deltacommit.UpsertDeltaCommitPartitioner.getSmallFiles(UpsertDeltaCommitPartitioner.java:103)
at 
org.apache.hudi.table.action.commit.UpsertPartitioner.lambda$getSmallFilesForPartitions$d2bd4b49$1(UpsertPartitioner.java:216)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1043)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1043)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$15.apply(RDD.scala:990)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$15.apply(RDD.scala:990)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   ```
   
   Here is a timeline of events as well leading up to the failure of the final 
delta commit `20210228152758`:
   
   
![image](https://user-images.githubusercontent.com/8647012/110020866-ec103080-7cde-11eb-859a-f539d4217f6a.png)
   
   As an observation you can see that there are two asynchronous compactions 
running at the same before the delta commit failure. After discussions with 
Vinoth, I also verified that there are no common file groups between the two 
concurrent compactions, so nothing suspicious was found there.
   
   I followed the code path, and as we can see the error stems from here where 
its assuming that some log files will be present in the file slice:
   
   
https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L103
   
   Upon following this piece of code, I found this suspicious code:
   
   
https://github.com/apache/hudi/blob/release-0.6.0/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L190
   
   It is generating an empty file slice if there is a pending compaction, 
without any base file or log file. This is bound to cause failure in the small 
files logic.
   
   Can you please confirm if my findings make sense to you ?
   
   Also from the comment:
   ```
   // If there is no delta-commit after compaction request, this step would 
ensure a new file-slice appears
   // so that any new ingestion uses the correct base-instant
   ```
   I only see us checking for pending compaction, but not for presence of delta 
commit after compaction request. So is that code actually doing whats its 
intended to ?
   
   As for our customer, I am suggestion for now to turn off the small files 
optimizations feature by setting `HoodieStorageConfig.PARQUET_FILE_MAX_BYTES` 
to `0`.
   
   Another action item from this, is that small files optimizations code needs 
a check to ignore the file slice if its completely empty instead of causing a 
failure, since empty file slice can be expected.
   
   
   **Environment

[GitHub] [hudi] vinothchandar commented on a change in pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-03-04 Thread GitBox



vinothchandar commented on a change in pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#discussion_r587049910



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -30,16 +31,19 @@
 import org.apache.hudi.callback.util.HoodieCommitCallbackFactory;
 import org.apache.hudi.client.embedded.EmbeddedTimelineService;
 import org.apache.hudi.client.heartbeat.HeartbeatUtils;
+import org.apache.hudi.client.lock.TransactionManager;

Review comment:
   `lock` as a package name feels off to me. Can we have 
`org.apache.hudi.client.transaction.TransactionManager`? 
   Then `.lock` can be a sub package under i.e `.transaction.lock.` 

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -30,16 +31,19 @@
 import org.apache.hudi.callback.util.HoodieCommitCallbackFactory;
 import org.apache.hudi.client.embedded.EmbeddedTimelineService;
 import org.apache.hudi.client.heartbeat.HeartbeatUtils;
+import org.apache.hudi.client.lock.TransactionManager;
 import org.apache.hudi.common.engine.HoodieEngineContext;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecordPayload;
 import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.model.TableService;

Review comment:
   So, `model` package should just contain pojos i.e data structure 
objects. Lets move `TableService` elsewhere

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -210,6 +231,11 @@ void emitCommitMetrics(String instantTime, 
HoodieCommitMetadata metadata, String
 }
   }
 
+  protected void preCommit(String instantTime, HoodieCommitMetadata metadata) {
+// no-op
+// TODO : Conflict resolution is not support for Flink,Java engines

Review comment:
   typo: not supported

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -797,7 +853,9 @@ public Boolean rollbackFailedWrites() {
* Performs a compaction operation on a table, serially before or after an 
insert/upsert action.
*/
   protected Option inlineCompact(Option> 
extraMetadata) {
-Option compactionInstantTimeOpt = 
scheduleCompaction(extraMetadata);
+String schedulingCompactionInstant = 
HoodieActiveTimeline.createNewInstantTime();

Review comment:
   rename: `compactionInstantTime` 

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/lock/ConflictResolutionStrategy.java
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.lock;
+
+import org.apache.hudi.ApiMaturityLevel;
+import org.apache.hudi.PublicAPIMethod;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieBackedTableMetadataWriter;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.stream.Stream;
+
+/**
+ * Strategy interface for conflict resolution with multiple writers.
+ * Users can provide pluggable implementations for different kinds of 
strategies to resolve conflicts when multiple
+ * writers are mutating the hoodie table.
+ */
+public interface ConflictResolutionStrategy {
+
+  /**
+   * Stream of instants to check conflicts against.
+   * @return
+   */
+  Stream getInstantsStream(HoodieActiveTimeline activeTimeline, 
HoodieInstant currentInstant, Option lastSuccessfulInstant);
+
+  /**
+   * Implementations of this method will determine whether a conflict exists 
between 2 commits.
+   * @param thisOperation
+   * @param otherOperation
+   * @return
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  boolean hasConflict(HoodieCommitOperation thisOperation,

[GitHub] [hudi] codecov-io commented on pull request #2632: [HUDI-1661] Exclude clustering commits from TimelineUtils API

2021-03-04 Thread GitBox



codecov-io commented on pull request #2632:
URL: https://github.com/apache/hudi/pull/2632#issuecomment-790858581


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=h1) Report
   > Merging 
[#2632](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=desc) (fd53720) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/899ae70fdb70c1511c099a64230fd91b2fe8d4ee?el=desc)
 (899ae70) will **increase** coverage by `17.85%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2632/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2632   +/-   ##
   =
   + Coverage 51.58%   69.44%   +17.85% 
   + Complexity 3285  363 -2922 
   =
 Files   446   53  -393 
 Lines 20409 1944-18465 
 Branches   2116  235 -1881 
   =
   - Hits  10528 1350 -9178 
   + Misses 9003  460 -8543 
   + Partials878  134  -744 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2632?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/cli/commands/RollbacksCommand.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1JvbGxiYWNrc0NvbW1hbmQuamF2YQ==)
 | | | |
   | 
[.../org/apache/hudi/common/model/HoodieFileGroup.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUZpbGVHcm91cC5qYXZh)
 | | | |
   | 
[...metadata/HoodieMetadataMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllTWV0YWRhdGFNZXJnZWRMb2dSZWNvcmRTY2FubmVyLmphdmE=)
 | | | |
   | 
[...di-cli/src/main/java/org/apache/hudi/cli/Main.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL01haW4uamF2YQ==)
 | | | |
   | 
[...mmon/table/log/AbstractHoodieLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9BYnN0cmFjdEhvb2RpZUxvZ1JlY29yZFNjYW5uZXIuamF2YQ==)
 | | | |
   | 
[.../versioning/clean/CleanPlanV2MigrationHandler.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5QbGFuVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | | | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | | | |
   | 
[.../main/scala/org/apache/hudi/cli/SparkHelpers.scala](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL2NsaS9TcGFya0hlbHBlcnMuc2NhbGE=)
 | | | |
   | 
[...src/main/java/org/apache/hudi/QuickstartUtils.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvUXVpY2tzdGFydFV0aWxzLmphdmE=)
 | | | |
   | 
[...apache/hudi/cli/commands/HoodieLogFileCommand.java](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0hvb2RpZUxvZ0ZpbGVDb21tYW5kLmphdmE=)
 | | | |
   | ... and [382 
more](https://codecov.io/gh/apache/hudi/pull/2632/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1661) Change utility methods that help get extra metadata to ignore internal rewrite commits

2021-03-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1661:
-
Labels: pull-request-available  (was: )

> Change utility methods that help get extra metadata to ignore internal 
> rewrite commits
> --
>
> Key: HUDI-1661
> URL: https://issues.apache.org/jira/browse/HUDI-1661
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>
> Exclude clustering commits from getExtraMetadataFromLatest API used for 
> getting user stored information such as checkpoint.
> Provide a new API incase there are use cases that store same metadata for 
> internal (clustering) and external (commit/deltacommit).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] satishkotha opened a new pull request #2632: [HUDI-1661] Exclude clustering commits from TimelineUtils API

2021-03-04 Thread GitBox



satishkotha opened a new pull request #2632:
URL: https://github.com/apache/hudi/pull/2632


   ## What is the purpose of the pull request
   Exclude internal rewrite commit such as clustering commits from 
getExtraMetadataFromLatest API
   
   ## Brief change log
   getExtraMetadataFromLatest API is used for getting user stored information 
such as checkpoint. Clustering commits dont have this information. So exclude 
clustering from this API. 
   
   Provide new API for use cases that require reading extra metadata from all 
commits including clustering
   
   ## Verify this pull request
   
   This change added tests
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1661) Change utility methods that help get extra metadata to ignore internal rewrite commits

2021-03-04 Thread satish (Jira)

satish created HUDI-1661:


 Summary: Change utility methods that help get extra metadata to 
ignore internal rewrite commits
 Key: HUDI-1661
 URL: https://issues.apache.org/jira/browse/HUDI-1661
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: satish
Assignee: satish


Exclude clustering commits from getExtraMetadataFromLatest API used for getting 
user stored information such as checkpoint.

Provide a new API incase there are use cases that store same metadata for 
internal (clustering) and external (commit/deltacommit).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] n3nash commented on pull request #2628: [HUDI-1635] Improvements to Hudi Test Suite

2021-03-04 Thread GitBox



n3nash commented on pull request #2628:
URL: https://github.com/apache/hudi/pull/2628#issuecomment-790821135


   @modi95 can you review this ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2611: [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat

2021-03-04 Thread GitBox



codecov-io edited a comment on pull request #2611:
URL: https://github.com/apache/hudi/pull/2611#issuecomment-787641189


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=h1) Report
   > Merging 
[#2611](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=desc) (427523d) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/be257b58c689510a21529019a766b7a2bfc7ebe6?el=desc)
 (be257b5) will **decrease** coverage by `41.65%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2611/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2611   +/-   ##
   
   - Coverage 51.27%   9.61%   -41.66% 
   + Complexity 3241  48 -3193 
   
 Files   438  53  -385 
 Lines 201261944-18182 
 Branches   2079 235 -1844 
   
   - Hits  10320 187-10133 
   + Misses 89541744 -7210 
   + Partials852  13  -839 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2611?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2611/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

[GitHub] [hudi] n3nash opened a new pull request #2631: [HUDI 1660] Excluding compaction and clustering instants from inflight rollback

2021-03-04 Thread GitBox



n3nash opened a new pull request #2631:
URL: https://github.com/apache/hudi/pull/2631


   ## What is the purpose of the pull request
   
   *This PR fixes a bug to ensure that pending compaction & clustering 
operations are always excluded when performing inflight rollbacks*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1660) Exclude pending compaction & clustering from rollback

2021-03-04 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1660:
-

 Summary: Exclude pending compaction & clustering from rollback
 Key: HUDI-1660
 URL: https://issues.apache.org/jira/browse/HUDI-1660
 Project: Apache Hudi
  Issue Type: Bug
  Components: Writer Core
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1656) Loading history data to new hudi table taking longer time

2021-03-04 Thread Fredrick jose antony cruz (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fredrick jose antony cruz updated HUDI-1656:

Description: 
spark-submit --jars 
/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/org.apache.avro_avro-1.8.2.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/spark-avro_2.11-2.4.4.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/hudi-spark-bundle_2.11-0.7.0.jar
 --master yarn --deploy-mode cluster --num-executors 50 --executor-cores 4 
--executor-memory 32g --driver-memory=24g --queue=default --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.driver.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar
 --conf 
spark.executor.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar:hudi-spark-bundle_2.11-0.7.0.jar
 --conf spark.memory.fraction=0.2 --driver-java-options "-XX:NewSize=1g 
-XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --files 
/usr/hdp/current/spark2-client/conf/hive-site.xml --class 
com.walmart.gis.order.workflows.WorkflowController 
lib/orders-poc-1.0.41-SNAPSHOT-shaded.jar workflow="stgStsWorkflow" 
runmode="global"

we are running on GCS cluster with 3 TB, 29 node cluster 870 v. core.

pom

1.8
2.11.12
2.3.0
1.8.2
2.4.4
0.7.0
1.4.0
UTF-8
1.8
1.8

stsDailyDf.write.format("org.apache.hudi")
  .option("hoodie.cleaner.commits.retained", 2)
  .option("hoodie.copyonwrite.record.size.estimate", 70)
  .option("hoodie.parquet.small.file.limit", 1)
  .option("hoodie.parquet.max.file.size", 12800)
  .option("hoodie.index.bloom.num_entries", 180)
  .option("hoodie.bloom.index.filter.type", "DYNAMIC_V0")
  .option("hoodie.bloom.index.filter.dynamic.max.entries", 250)
  .option("hoodie.datasource.write.operation", "upsert")
  .option("hoodie.datasource.write.storage.type", "COPY_ON_WRITE")
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
"sales_order_sts_line_key")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "REL_STS_DT")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "src_upd_ts")
  .option(HoodieWriteConfig.TABLE_NAME, tableName.toString)
  .option("hoodie.bloom.index.bucketized.checking", "false")
  .mode(SaveMode.Append)
  .save(tablePath.toString)


> Loading history data to new hudi table taking longer time
> -
>
> Key: HUDI-1656
> URL: https://issues.apache.org/jira/browse/HUDI-1656
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: newbie
>Reporter: Fredrick jose antony cruz
>Priority: Major
> Fix For: 0.7.0
>
>
> spark-submit --jars 
> /u/users/svcordrdats/order_hudi_poc/hudi-support-jars/org.apache.avro_avro-1.8.2.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/spark-avro_2.11-2.4.4.jar,/u/users/svcordrdats/order_hudi_poc/hudi-support-jars/hudi-spark-bundle_2.11-0.7.0.jar
>  --master yarn --deploy-mode cluster --num-executors 50 --executor-cores 4 
> --executor-memory 32g --driver-memory=24g --queue=default --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
> spark.driver.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar
>  --conf 
> spark.executor.extraClassPath=org.apache.avro_avro-1.8.2.jar:spark-avro_2.11-2.4.4.jar:hudi-spark-bundle_2.11-0.7.0.jar
>  --conf spark.memory.fraction=0.2 --driver-java-options "-XX:NewSize=1g 
> -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
> -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --files 
> /usr/hdp/current/spark2-client/conf/hive-site.xml --class 
> com.walmart.gis.order.workflows.WorkflowController 
> lib/orders-poc-1.0.41-SNAPSHOT-shaded.jar workflow="stgStsWorkflow" 
> runmode="global"
> we are running on GCS cluster with 3 TB, 29 node cluster 870 v. core.
> pom
> 1.8
> 2.11.12
> 2.3.0
> 1.8.2
> 2.4.4
> 0.7.0
> 1.4.0
> UTF-8
> 1.8
> 1.8
> stsDailyDf.write.format("org.apache.hudi")
>   .option("hoodie.cleaner.commits.retained",

[jira] [Updated] (HUDI-1659) Support DDL And Insert For Hudi In Spark Sql

2021-03-04 Thread pengzhiwei (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengzhiwei updated HUDI-1659:
-
Summary: Support DDL And Insert  For Hudi In Spark Sql  (was: DDL Support 
For Hudi In Spark Sql)

> Support DDL And Insert  For Hudi In Spark Sql
> -
>
> Key: HUDI-1659
> URL: https://issues.apache.org/jira/browse/HUDI-1659
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] rswagatika commented on issue #2564: Hoodie clean is not deleting old files

2021-03-04 Thread GitBox



rswagatika commented on issue #2564:
URL: https://github.com/apache/hudi/issues/2564#issuecomment-790707124


   @bvaradar  Hi I was able to get the files but the driver log am trying will 
provide you once i have it. Let me know if this is what you meaning for 
recursive s3 dataset?
   
   2021-02-10 18:44:030 Bytes 
my_test/.hoodie/.aux/.bootstrap/.fileids_$folder$
   2021-02-10 18:44:020 Bytes 
my_test/.hoodie/.aux/.bootstrap/.partitions_$folder$
   2021-02-10 18:44:020 Bytes my_test/.hoodie/.aux/.bootstrap_$folder$
   2021-02-12 12:19:223.9 KiB 
my_test/.hoodie/.aux/20210212171919.compaction.requested
   2021-02-12 16:12:233.9 KiB 
my_test/.hoodie/.aux/20210212211220.compaction.requested
   2021-02-10 18:44:020 Bytes my_test/.hoodie/.aux_$folder$
   2021-02-12 12:19:240 Bytes 
my_test/.hoodie/.temp/20210212171919/range_partition=default/af837aeb-3881-43f0-8aa4-434502059b68-0_0-33-41_20210212171919.parquet.marker.MERGE
   2021-02-12 12:19:240 Bytes 
my_test/.hoodie/.temp/20210212171919/range_partition=default_$folder$
   2021-02-12 12:19:240 Bytes my_test/.hoodie/.temp/20210212171919_$folder$
   2021-02-12 16:12:250 Bytes 
my_test/.hoodie/.temp/20210212211220/range_partition=default/af837aeb-3881-43f0-8aa4-434502059b68-0_0-41-43_20210212211220.parquet.marker.MERGE
   2021-02-12 16:12:250 Bytes 
my_test/.hoodie/.temp/20210212211220/range_partition=default_$folder$
   2021-02-12 16:12:250 Bytes my_test/.hoodie/.temp/20210212211220_$folder$
   2021-02-10 18:44:020 Bytes my_test/.hoodie/.temp_$folder$
   2021-02-10 19:23:35  969 Bytes my_test/.hoodie/20210211002333.rollback
   2021-02-10 19:23:350 Bytes 
my_test/.hoodie/20210211002333.rollback.inflight
   2021-02-10 19:24:22  969 Bytes my_test/.hoodie/20210211002421.rollback
   2021-02-10 19:24:220 Bytes 
my_test/.hoodie/20210211002421.rollback.inflight
   2021-02-10 19:34:30  969 Bytes my_test/.hoodie/20210211003429.rollback
   2021-02-10 19:34:300 Bytes 
my_test/.hoodie/20210211003429.rollback.inflight
   2021-02-10 19:35:18  969 Bytes my_test/.hoodie/20210211003516.rollback
   2021-02-10 19:35:180 Bytes 
my_test/.hoodie/20210211003516.rollback.inflight
   2021-02-10 19:35:51  969 Bytes my_test/.hoodie/20210211003549.rollback
   2021-02-10 19:35:500 Bytes 
my_test/.hoodie/20210211003549.rollback.inflight
   2021-02-10 19:36:39  969 Bytes my_test/.hoodie/20210211003637.rollback
   2021-02-10 19:36:380 Bytes 
my_test/.hoodie/20210211003637.rollback.inflight
   2021-02-10 19:37:13  969 Bytes my_test/.hoodie/20210211003711.rollback
   2021-02-10 19:37:120 Bytes 
my_test/.hoodie/20210211003711.rollback.inflight
   2021-02-10 19:38:02  969 Bytes my_test/.hoodie/20210211003800.rollback
   2021-02-10 19:38:010 Bytes 
my_test/.hoodie/20210211003800.rollback.inflight
   2021-02-11 01:05:37  969 Bytes my_test/.hoodie/20210211060535.rollback
   2021-02-11 01:05:360 Bytes 
my_test/.hoodie/20210211060535.rollback.inflight
   2021-02-11 03:36:47  969 Bytes my_test/.hoodie/20210211083645.rollback
   2021-02-11 03:36:470 Bytes 
my_test/.hoodie/20210211083645.rollback.inflight
   2021-02-11 06:25:37  969 Bytes my_test/.hoodie/2021022535.rollback
   2021-02-11 06:25:370 Bytes 
my_test/.hoodie/2021022535.rollback.inflight
   2021-02-11 21:21:47  969 Bytes my_test/.hoodie/20210212022145.rollback
   2021-02-11 21:21:460 Bytes 
my_test/.hoodie/20210212022145.rollback.inflight
   2021-02-12 10:24:29 1021 Bytes my_test/.hoodie/20210212152423.rollback
   2021-02-12 10:24:290 Bytes 
my_test/.hoodie/20210212152423.rollback.inflight
   2021-02-12 12:19:201.5 KiB my_test/.hoodie/20210212171816.deltacommit
   2021-02-12 12:19:171.6 KiB 
my_test/.hoodie/20210212171816.deltacommit.inflight
   2021-02-12 12:18:220 Bytes 
my_test/.hoodie/20210212171816.deltacommit.requested
   2021-02-12 12:21:271.5 KiB my_test/.hoodie/20210212171919.commit
   2021-02-12 12:19:220 Bytes 
my_test/.hoodie/20210212171919.compaction.inflight
   2021-02-12 12:19:223.9 KiB 
my_test/.hoodie/20210212171919.compaction.requested
   2021-02-12 14:55:261.5 KiB my_test/.hoodie/20210212195421.deltacommit
   2021-02-12 14:55:231.6 KiB 
my_test/.hoodie/20210212195421.deltacommit.inflight
   2021-02-12 14:54:280 Bytes 
my_test/.hoodie/20210212195421.deltacommit.requested
   2021-02-12 16:12:211.5 KiB my_test/.hoodie/2021021227.deltacommit
   2021-02-12 16:12:181.6 KiB 
my_test/.hoodie/2021021227.deltacommit.inflight
   2021-02-12 16:11:240 Bytes 
my_test/.hoodie/2021021227.deltacommit.requested
   2021-02-12 16:14:281.5 KiB my_test/.hoodie/20210212211220.commit
   2021-02-12 16:12:230 Bytes 
my_test/.hoodie/20210212211220.compaction.inflight
   2021-02-12 16:12:233.9 KiB 
my_test/.hoodie/20210212211220.compaction.requested
   2021-02-11 09:45:40   26.4 KiB 
my_test/.hoodie/archived/.commits_.archive.10_1-0-1
   2021-02-11 16:24:45   26.5

[jira] [Created] (HUDI-1659) DDL Support For Hudi In Spark Sql

2021-03-04 Thread pengzhiwei (Jira)

pengzhiwei created HUDI-1659:


 Summary: DDL Support For Hudi In Spark Sql
 Key: HUDI-1659
 URL: https://issues.apache.org/jira/browse/HUDI-1659
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: pengzhiwei
Assignee: pengzhiwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1658) Spark Sql Support For Hudi

2021-03-04 Thread pengzhiwei (Jira)

pengzhiwei created HUDI-1658:


 Summary: Spark Sql Support For Hudi
 Key: HUDI-1658
 URL: https://issues.apache.org/jira/browse/HUDI-1658
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: pengzhiwei
Assignee: pengzhiwei
 Fix For: 0.8.0


This is the main task for supporting spark sql for hudi, including the DDL、DML 
and Hoodie CLI command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] guanziyue opened a new issue #2630: [SUPPORT]Confuse about the strategy to evaluate average record size

2021-03-04 Thread GitBox



guanziyue opened a new issue #2630:
URL: https://github.com/apache/hudi/issues/2630


   **Describe the problem you faced**
   
   In the UpsertPartitioner class, the method called averageBytesPerRecord is 
used to evaluate the average record size according to previous commits. There 
is a fileSizeThreshold used to filter commit which can be evaluated.
   Such value is calculated as 
   "hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit();"
   May I have known why it goes like this?  It may need a huge enough commit 
make this strategy activated.
   
   
   **Expected behavior**
   
   A small value may also work?  How about hoodie.parquet.max.file.size?
   
   **Environment Description**
   
   * Hudi version :0.6.0
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2613: [HUDI-1647] Supports snapshot read for Flink

2021-03-04 Thread GitBox



danny0405 commented on a change in pull request #2613:
URL: https://github.com/apache/hudi/pull/2613#discussion_r587409966



##
File path: 
hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory
##
@@ -0,0 +1,17 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+org.apache.hudi.factory.HoodieTableFactory

Review comment:
   Required for the java SPI service.

##
File path: 
hudi-flink/src/test/resources/META-INF/services/org.apache.flink.table.factories.TableFactory
##
@@ -0,0 +1,18 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+org.apache.hudi.factory.HoodieTableFactory
+org.apache.hudi.utils.factory.ContinuousFileSourceFactory

Review comment:
   Required for the java SPI service.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2621: [HUDI-1655] Allow custom date format in DatePartitionPathSelector

2021-03-04 Thread GitBox



yanghua commented on a change in pull request #2621:
URL: https://github.com/apache/hudi/pull/2621#discussion_r587389540



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DatePartitionPathSelector.java
##
@@ -130,20 +140,19 @@ public DatePartitionPathSelector(TypedProperties props, 
Configuration hadoopConf
   FileSystem fs = new Path(path).getFileSystem(serializedConf.get());
   return listEligibleFiles(fs, new Path(path), 
lastCheckpointTime).stream();
 }, partitionsListParallelism);
-// sort them by modification time.
-
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));
+// sort them by modification time ascending.
+List sortedEligibleFiles = eligibleFiles.stream()
+
.sorted(Comparator.comparingLong(FileStatus::getModificationTime)).collect(Collectors.toList());

Review comment:
   you mean the old logic exists this bug?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on a change in pull request #2613: [HUDI-1647] Supports snapshot read for Flink

2021-03-04 Thread GitBox



garyli1019 commented on a change in pull request #2613:
URL: https://github.com/apache/hudi/pull/2613#discussion_r587262267



##
File path: 
hudi-flink/src/main/resources/META-INF/services/org.apache.flink.table.factories.TableFactory
##
@@ -0,0 +1,17 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+org.apache.hudi.factory.HoodieTableFactory

Review comment:
   is this file necessary?

##
File path: 
hudi-flink/src/test/java/org/apache/hudi/source/HoodieDataSourceITCase.java
##
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.source;
+
+import org.apache.hudi.operator.FlinkOptions;
+import org.apache.hudi.operator.utils.TestConfigurations;
+
+import org.apache.flink.table.api.EnvironmentSettings;
+import org.apache.flink.table.api.TableEnvironment;
+import org.apache.flink.table.api.TableResult;
+import org.apache.flink.table.api.config.ExecutionConfigOptions;
+import org.apache.flink.table.api.internal.TableEnvironmentImpl;
+import org.apache.flink.test.util.AbstractTestBase;
+import org.apache.flink.types.Row;
+import org.apache.flink.util.CollectionUtil;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.concurrent.ExecutionException;
+import java.util.stream.Collectors;
+
+import static org.hamcrest.CoreMatchers.is;
+import static org.hamcrest.MatcherAssert.assertThat;
+
+/**
+ * IT cases for Hoodie table source and sink.
+ *
+ * Note: should add more SQL cases when batch write is supported.
+ */
+public class HoodieDataSourceITCase extends AbstractTestBase {
+  private TableEnvironment streamTableEnv;
+  private TableEnvironment batchTableEnv;
+
+  @BeforeEach
+  void beforeEach() {
+EnvironmentSettings settings = EnvironmentSettings.newInstance().build();
+streamTableEnv = TableEnvironmentImpl.create(settings);
+streamTableEnv.getConfig().getConfiguration()
+
.setInteger(ExecutionConfigOptions.TABLE_EXEC_RESOURCE_DEFAULT_PARALLELISM, 1);
+streamTableEnv.getConfig().getConfiguration()
+.setString("execution.checkpointing.interval", "2s");
+
+settings = EnvironmentSettings.newInstance().inBatchMode().build();
+batchTableEnv = TableEnvironmentImpl.create(settings);
+batchTableEnv.getConfig().getConfiguration()
+
.setInteger(ExecutionConfigOptions.TABLE_EXEC_RESOURCE_DEFAULT_PARALLELISM, 1);
+  }
+
+  @TempDir
+  File tempFile;
+
+  @Test
+  void testStreamWriteBatchRead() {

Review comment:
   is that possible to add more test cases to cover COW/MOR(without 
comapction)/MOR(with compaction) and query?

##
File path: 
hudi-flink/src/test/resources/META-INF/services/org.apache.flink.table.factories.TableFactory
##
@@ -0,0 +1,18 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless

[GitHub] [hudi] chaplinthink opened a new pull request #2629: [DOCS] Fix docs hive_sync path

2021-03-04 Thread GitBox



chaplinthink opened a new pull request #2629:
URL: https://github.com/apache/hudi/pull/2629


   Fix docs hive_sync path
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

85 matches

Mail list logo