[jira] [Comment Edited] (HUDI-1214) Need ability to set deltastreamer checkpoints when doing Spark datasource writes
[ https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182960#comment-17182960 ] Trevorzhang edited comment on HUDI-1214 at 8/24/20, 5:52 AM: - hi,[~vbalaji], I want to claim this jiar , if no one does it. was (Author: trevorzhang): hi,Balaji Varadarajan, I want to claim this jiar , if no one does it. > Need ability to set deltastreamer checkpoints when doing Spark datasource > writes > > > Key: HUDI-1214 > URL: https://issues.apache.org/jira/browse/HUDI-1214 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > Such support is needed for bootstrapping cases when users use spark write to > do initial bootstrap and then subsequently use deltastreamer. > DeltaStreamer manages checkpoints inside hoodie commit files and expects > checkpoints in previously committed metadata. Users are expected to pass > checkpoint or initial checkpoint provider when performing bootstrap through > deltastreamer. Such support is not present when doing bootstrap using Spark > Datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1201) HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset when commit files do not have checkpoint
[ https://issues.apache.org/jira/browse/HUDI-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182959#comment-17182959 ] Trevorzhang commented on HUDI-1201: --- hi,[~vbalaji], I want to claim this jiar , if no one does it. > HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset > when commit files do not have checkpoint > - > > Key: HUDI-1201 > URL: https://issues.apache.org/jira/browse/HUDI-1201 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > [https://github.com/apache/hudi/issues/1985] > > It would be easier for user to just specify deltastreamer to read from > earliest offset instead of implementing -initial-checkpoint-provider or > passing raw kafka checkpoints when the table was initially bootstrapped > through spark.write(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1214) Need ability to set deltastreamer checkpoints when doing Spark datasource writes
[ https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182960#comment-17182960 ] Trevorzhang commented on HUDI-1214: --- hi,Balaji Varadarajan, I want to claim this jiar , if no one does it. > Need ability to set deltastreamer checkpoints when doing Spark datasource > writes > > > Key: HUDI-1214 > URL: https://issues.apache.org/jira/browse/HUDI-1214 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > Such support is needed for bootstrapping cases when users use spark write to > do initial bootstrap and then subsequently use deltastreamer. > DeltaStreamer manages checkpoints inside hoodie commit files and expects > checkpoints in previously committed metadata. Users are expected to pass > checkpoint or initial checkpoint provider when performing bootstrap through > deltastreamer. Such support is not present when doing bootstrap using Spark > Datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] bhasudha opened a new pull request #2016: [WIP] Add release page doc for 0.6.0
bhasudha opened a new pull request #2016: URL: https://github.com/apache/hudi/pull/2016 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys
garyli1019 commented on issue #2013: URL: https://github.com/apache/hudi/issues/2013#issuecomment-678881198 @rubenssoto Hello, the incremental pulling for MOR table is currently under review and will be available in the 0.6.1 release, which will be shortly after the 0.6.0 release. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sreeram26 commented on pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured
sreeram26 commented on pull request #2014: URL: https://github.com/apache/hudi/pull/2014#issuecomment-678878193 @bvaradar This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Trevor-zhang commented on pull request #2015: [HUDI-1103]Fix Delete data demo in Quick-Start Guide
Trevor-zhang commented on pull request #2015: URL: https://github.com/apache/hudi/pull/2015#issuecomment-678877259 @nsivabalan can u take a look when free? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1103: - Labels: pull-request-available (was: ) > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.1 > > > {color}Currently, the delete data demo code is not runnable in spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] Trevor-zhang opened a new pull request #2015: [HUDI-1103]Fix Delete data demo in Quick-Start Guide
Trevor-zhang opened a new pull request #2015: URL: https://github.com/apache/hudi/pull/2015 Fix Delete data demo in Quick-Start Guide ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sreeram26 commented on a change in pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured
sreeram26 commented on a change in pull request #2014: URL: https://github.com/apache/hudi/pull/2014#discussion_r475306551 ## File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java ## @@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String } public static JavaRDD doWriteOperation(HoodieWriteClient client, JavaRDD hoodieRecords, - String instantTime, String operation) throws HoodieException { -if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) { + String instantTime, WriteOperationType operation) throws HoodieException { +if (operation == WriteOperationType.BULK_INSERT) { Option userDefinedBulkInsertPartitioner = createUserDefinedBulkInsertPartitioner(client.getConfig()); return client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner); -} else if (operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())) { +} else if (operation == WriteOperationType.INSERT) { return client.insert(hoodieRecords, instantTime); } else { // default is upsert + if (operation != WriteOperationType.UPSERT) { Review comment: Not throwing an explicit error here, the only other value it can potentially have is Bootstrap based on the enum, the issue which exposed this issue would have thrown an exception on WriteOperationType.fromValue itself. Can change to throw a HoodieException, if the reviewer feels that is necessary This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sreeram26 commented on a change in pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured
sreeram26 commented on a change in pull request #2014: URL: https://github.com/apache/hudi/pull/2014#discussion_r475306551 ## File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java ## @@ -248,15 +249,18 @@ public static HoodieWriteClient createHoodieClient(JavaSparkContext jssc, String } public static JavaRDD doWriteOperation(HoodieWriteClient client, JavaRDD hoodieRecords, - String instantTime, String operation) throws HoodieException { -if (operation.equals(DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())) { + String instantTime, WriteOperationType operation) throws HoodieException { +if (operation == WriteOperationType.BULK_INSERT) { Option userDefinedBulkInsertPartitioner = createUserDefinedBulkInsertPartitioner(client.getConfig()); return client.bulkInsert(hoodieRecords, instantTime, userDefinedBulkInsertPartitioner); -} else if (operation.equals(DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())) { +} else if (operation == WriteOperationType.INSERT) { return client.insert(hoodieRecords, instantTime); } else { // default is upsert + if (operation != WriteOperationType.UPSERT) { Review comment: Not throwing an explicit error here, the only other value it can potentially have is Bootstrap based on the enum, the issue which exposed this issue would have thrown an exception on WriteOperationType.fromValue itself This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1153) Spark DataSource and Streaming Write must fail when operation type is misconfigured
[ https://issues.apache.org/jira/browse/HUDI-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1153: - Labels: pull-request-available (was: ) > Spark DataSource and Streaming Write must fail when operation type is > misconfigured > --- > > Key: HUDI-1153 > URL: https://issues.apache.org/jira/browse/HUDI-1153 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Assignee: Sreeram Ramji >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > Context: [https://github.com/apache/hudi/issues/1902#issuecomment-669698259] > > If you look at DataSourceUtils.java, > [https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L257] > > we are string comparison to determine operation type which is a bad idea and > a typo could result in "upsert" being used silently. > > Just like > [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L187] > being used for DeltaStreamer, we need similar enums defined in > DataSourceOptions.scala for OPERATION_OPT_KEY but care must be taken to > ensure we do not cause backwards compatibility issue by changing the property > value. In other words, we need to retain the lower case values > ("bulk_insert", "insert" and "upsert") but make it an enum. > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] sreeram26 opened a new pull request #2014: [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured
sreeram26 opened a new pull request #2014: URL: https://github.com/apache/hudi/pull/2014 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request Currently for Spark Streaming Write operation is being manually string compared on usage in most of the code, we also silently swallow illegal operation types by defaulting to upsert. This addresses these issues ## Brief change log - [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured ## Verify this pull request This pull request is already covered by existing tests, such as *(please describe tests)*. TestDataSourceUtils * testDoWriteOperationWithoutUserDefinedBulkInsertPartitioner * testDoWriteOperationWithNonExistUserDefinedBulkInsertPartitioner * testDoWriteOperationWithUserDefinedBulkInsertPartitioner If all existing tests pass. This should be good to review - [ ] Existing tests pass ## Committer checklist - [x] Has a corresponding JIRA in PR title & commit - [x] Commit message is descriptive of the change - [ ] CI is green - [x] Necessary doc changes done or have another open PR - None - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA- Not a large task This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes
yanghua commented on a change in pull request #1901: URL: https://github.com/apache/hudi/pull/1901#discussion_r475291250 ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java ## @@ -50,6 +50,9 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertNotEquals; +/** + * Base test class for IT Test. helps to run cmd and generate data. Review comment: `Base test class for IT Test helps to run command and generate data.`? ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java ## @@ -52,6 +52,9 @@ import org.junit.jupiter.api.Test; import org.mockito.Mockito; +/** + * {@link HoodieTestSuiteWriter}. Helps to test writing a DFS file. Review comment: `Helps`? This usage may not be correct? ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/utils/TestUtils.java ## @@ -45,6 +48,15 @@ return dataGenerator.generateGenericRecords(numRecords); } + /** + * Method help to create avro files and save it to file. Review comment: `Methods`? ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/configuration/TestWorkflowBuilder.java ## @@ -30,41 +30,58 @@ import org.apache.hudi.integ.testsuite.dag.WorkflowDag; import org.junit.jupiter.api.Test; +/** + * Unit test for the build process of {@link DagNode} and {@link WorkflowDag}. + */ public class TestWorkflowBuilder { @Test public void testWorkloadOperationSequenceBuilder() { Review comment: please remove all the comments of this method ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java ## @@ -49,6 +49,9 @@ import org.junit.jupiter.params.provider.Arguments; import org.junit.jupiter.params.provider.MethodSource; +/** + * Unit tests against {@link HoodieTestSuiteJob}. Review comment: `Unit test`? ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/converter/TestUpdateConverter.java ## @@ -49,11 +53,16 @@ public void teardown() { jsc.stop(); } + /** + * Test {@link UpdateConverter} by generates random updates from existing records. + */ @Test public void testGenerateUpdateRecordsFromInputRecords() throws Exception { +// 1. prepare input record Review comment: `record` -> `records` ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/utils/TestUtils.java ## @@ -45,6 +48,15 @@ return dataGenerator.generateGenericRecords(numRecords); } + /** + * Method help to create avro files and save it to file. + * + * @param jsc {@link JavaSparkContext}. Review comment: We should not only use `{@link }` in the comment, add more description. ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/dag/ComplexDagGenerator.java ## @@ -46,6 +51,7 @@ public WorkflowDag build() { .withNumInsertPartitions(1) .withRecordSize(1).build()); +// function to build ValidateNode with Review comment: with what? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevorzhang updated HUDI-1103: -- Description: {color}Currently, the delete data demo code is not runnable in spark-shell {code:java} scala> val df = spark df: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@74e7d97bscala> .read :1: error: illegal start of definition .read ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) :1: error: illegal start of definition .json(spark.sparkContext.parallelize(deletes, 2)) ^ {code} This dot symbol should be at the end of the line or put a "\" at the end was: {color:red}着色文本{color}Currently, the delete data demo code is not runnable in spark-shell {code:java} scala> val df = spark df: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@74e7d97bscala> .read :1: error: illegal start of definition .read ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) :1: error: illegal start of definition .json(spark.sparkContext.parallelize(deletes, 2)) ^ {code} This dot symbol should be at the end of the line or put a "\" at the end > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Fix For: 0.6.1 > > > {color}Currently, the delete data demo code is not runnable in spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trevorzhang updated HUDI-1103: -- Description: {color:red}着色文本{color}Currently, the delete data demo code is not runnable in spark-shell {code:java} scala> val df = spark df: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@74e7d97bscala> .read :1: error: illegal start of definition .read ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) :1: error: illegal start of definition .json(spark.sparkContext.parallelize(deletes, 2)) ^ {code} This dot symbol should be at the end of the line or put a "\" at the end was: Currently, the delete data demo code is not runnable in spark-shell {code:java} scala> val df = spark df: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@74e7d97bscala> .read :1: error: illegal start of definition .read ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) :1: error: illegal start of definition .json(spark.sparkContext.parallelize(deletes, 2)) ^ {code} This dot symbol should be at the end of the line or put a "\" at the end > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Fix For: 0.6.1 > > > {color:red}着色文本{color}Currently, the delete data demo code is not runnable in > spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] cdmikechen edited a comment on issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark
cdmikechen edited a comment on issue #2005: URL: https://github.com/apache/hudi/issues/2005#issuecomment-678860501 > @cdmikechen : Also, if you look at integration tests ITTestHoodieDemo, we cover the tests with hive syncing and this test has been passing for us. Can you take a look at the tests to see what the difference is ? @bvaradar I checked `hudi-integ-test` package and found the reason: In `hudi-integ-test` pom.xml where contains `ITTestHoodieDemo`, hudi contains `hudi-exec-2.3.1` In pom dependencies. So that if we new a `MapredParquetInputFormat` class, hudi will use this class by `hudi-exec-2.3.1`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat implements VectorizedInputFormatInterface { ``` But if we just use a standalone spark environmental without hive-2.3.1 dependencies (like starting a new project and only depend spark lib), hudi will use `hive-exec-1.2.1-spark2`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.RecordReader; import parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen edited a comment on issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark
cdmikechen edited a comment on issue #2005: URL: https://github.com/apache/hudi/issues/2005#issuecomment-678860501 > @cdmikechen : Also, if you look at integration tests ITTestHoodieDemo, we cover the tests with hive syncing and this test has been passing for us. Can you take a look at the tests to see what the difference is ? @bvaradar I checked `hudi-integ-test` package and found the reason: In `hudi-integ-test` pom.xml where contains `ITTestHoodieDemo`, hudi contains `hudi-exec-2.3.1` In pom dependencies. So that if we new a `MapredParquetInputFormat` class, hudi will use this class by `hudi-exec-2.3.1`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat implements VectorizedInputFormatInterface { ``` But if we just use a standalone spark environmental without hive-2.3.1 dependencies (like starting a new project and only depend spark lib), hudi will use `hive-exec-1.2.1-spark`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.RecordReader; import parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen edited a comment on issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark
cdmikechen edited a comment on issue #2005: URL: https://github.com/apache/hudi/issues/2005#issuecomment-678860501 > @cdmikechen : Also, if you look at integration tests ITTestHoodieDemo, we cover the tests with hive syncing and this test has been passing for us. Can you take a look at the tests to see what the difference is ? @bvaradar I checked `hudi-integ-test` package and found the reason: In `hudi-integ-test` pom.xml where contains `ITTestHoodieDemo`, hudi contains `hudi-exec-2.3.1` In pom dependencies. So that if we new a `MapredParquetInputFormat` class, hudi will use this class by `hudi-exec-2.3.1`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat implements VectorizedInputFormatInterface { ``` But if we just use a standalone spark environmental without a hive-2.3.1 dependencies (like starting a new project and only depend spark lib), hudi will use `hive-exec-1.2.1-spark`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.RecordReader; import parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark
cdmikechen commented on issue #2005: URL: https://github.com/apache/hudi/issues/2005#issuecomment-678860501 > @cdmikechen : Also, if you look at integration tests ITTestHoodieDemo, we cover the tests with hive syncing and this test has been passing for us. Can you take a look at the tests to see what the difference is ? @bvaradar I checked `hudi-integ-test` package and found the reason: In `hudi-integ-test` pom.xml where contains `ITTestHoodieDemo`, hudi contains `hudi-exec-2.3.1` In this dependency. So that if we new a `MapredParquetInputFormat` class, hudi will use this class by `hudi-exec-2.3.1`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat implements VectorizedInputFormatInterface { ``` But if we just use a standalone spark environmental without a hive-2.3.1 dependencies (like starting a new project and only depend spark lib), hudi will use `hive-exec-1.2.1-spark`. ```java package org.apache.hadoop.hive.ql.io.parquet; import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.exec.Utilities; import org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface; import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; import org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.RecordReader; import parquet.hadoop.ParquetInputFormat; public class MapredParquetInputFormat extends FileInputFormat { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on a change in pull request #1996: [BLOG] Async Compaction and Efficient Migration of large Parquet tables
bvaradar commented on a change in pull request #1996: URL: https://github.com/apache/hudi/pull/1996#discussion_r475285098 ## File path: docs/_posts/2020-08-21-async-compaction-deployment-model.md ## @@ -0,0 +1,99 @@ +--- +title: "Async Compaction Deployment Models" +excerpt: "Mechanisms for executing compaction jobs in Hudi asynchronously" +author: vbalaji +category: blog +--- + +We will look at different deployment models for executing compactions asynchronously. + +# Compaction + +For Merge-On-Read table, data is stored using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. +Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or +asynchronously. One of th main motivations behind Merge-On-Read is to reduce data latency when ingesting records. +Hence, it makes sense to run compaction asynchronously without blocking ingestion. + + +# Async Compaction + +Async Compaction is performed in 2 steps: + +1. ***Compaction Scheduling***: This is done by the ingestion job. In this step, Hudi scans the partitions and selects **file +slices** to be compacted. A compaction plan is finally written to Hudi timeline. +1. ***Compaction Execution***: A separate process reads the compaction plan and performs compaction of file slices. + + +# Deployment Models + +There are few ways by which we can execute compactions asynchronously. + +## Spark Structured Streaming + +With 0.6.0, we now have support for running async compactions in Spark +Structured Streaming jobs. Compactions are scheduled and executed asynchronously inside the +streaming job. Async Compactions are enabled by default for structured streaming jobs +on Merge-On-Read table. + +Here is an example snippet in java + +```properties +import org.apache.hudi.DataSourceWriteOptions; +import org.apache.hudi.HoodieDataSourceHelpers; +import org.apache.hudi.config.HoodieCompactionConfig; +import org.apache.hudi.config.HoodieWriteConfig; + +import org.apache.spark.sql.streaming.OutputMode; +import org.apache.spark.sql.streaming.ProcessingTime; + + + DataStreamWriter writer = streamingInput.writeStream().format("org.apache.hudi") +.option(DataSourceWriteOptions.OPERATION_OPT_KEY(), operationType) +.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType) +.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") +.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") +.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") +.option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, "10") +.option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY(), "true") +.option(HoodieWriteConfig.TABLE_NAME, tableName).option("checkpointLocation", checkpointLocation) +.outputMode(OutputMode.Append()); + writer.trigger(new ProcessingTime(3)).start(tablePath); +``` + +## DeltaStreaminer Continuous Mode Review comment: Fixed. Thanks, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys
bvaradar commented on issue #2013: URL: https://github.com/apache/hudi/issues/2013#issuecomment-678842778 @garyli1019 : I would let you answer this question. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena
rubenssoto commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-678824833 @umehrot2 @vinothchandar Path Filter improvements, could be achieved updating some Hudi Lib in presto? Because emr presto is 0.232, and these improvements were made in 0.233. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto opened a new issue #2013: [SUPPORT] MoR tables SparkDataSource Incremental Querys
rubenssoto opened a new issue #2013: URL: https://github.com/apache/hudi/issues/2013 Hi Guys, I have a table could have updated at any point in time, so I would try MoR tables, this table would be a source for my Redshift DW, so I need a method to pull this data incrementally. I saw that Spark Datasource only query MoR tables in batch, so, would be good full support of Hudi on spark datasources and full support of hudi in a spark structure streaming source. I found some Jira tickets with this topic. https://issues.apache.org/jira/projects/HUDI/issues/HUDI-920?filter=allopenissues https://issues.apache.org/jira/projects/HUDI/issues/HUDI-1109?filter=allopenissues This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sathyaprakashg commented on pull request #2012: HUDI-1129 Deltastreamer Add support for schema evaluation
sathyaprakashg commented on pull request #2012: URL: https://github.com/apache/hudi/pull/2012#issuecomment-678806335 @bvaradar @sbernauer This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sathyaprakashg opened a new pull request #2012: HUDI-1129 Deltastreamer Add support for schema evaluation
sathyaprakashg opened a new pull request #2012: URL: https://github.com/apache/hudi/pull/2012 ## What is the purpose of the pull request When schema is evolved but producer is still producing events using older version of schema, Hudi delta streamer is failing. This fix is to make sure delta streamer works fine with schema evoluation. Related issues #1845 #1971 #1972 ## Brief change log - Update avro to spark conversion method `AvroConversionHelper.createConverterToRow` to handle scenario when provided schema has more fields than data (scenario where producer is still sending events with old schema) - Introduce new schema provider class called `SchemaBasedSchemaProvider`. This is used to set schema based on schema of the data. Currently, `HoodieAvroUtils.avroToBytes` uses the schema of the data to convert to bytes, but `HoodieAvroUtils.bytesToAvro` uses provided schema. Since both may not match always, it results in error. By using data's schema using new schema provider, we can ensure, same schema is used for converting avro to bytes and bytes back to avro. ## Verify this pull request This change added tests and can be verified as follows: - *Added unit test to verify schema evoluation* Thanks @sbernauer for unit test ## Committer checklist - [x] Has a corresponding JIRA in PR title & commit - [x] Commit message is descriptive of the change - [x] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1103: -- Parent: HUDI-1215 Issue Type: Sub-task (was: Task) > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Fix For: 0.6.1 > > > Currently, the delete data demo code is not runnable in spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1150) Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-1150. -- Resolution: Fixed Fixed via master branch: 35b21855da209c812e006c1afff3d940d5ac2a18 > Fix unable to parse input partition field :1 exception when using > TimestampBasedKeyGenerator > - > > Key: HUDI-1150 > URL: https://issues.apache.org/jira/browse/HUDI-1150 > Project: Apache Hudi > Issue Type: Bug >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > scene to reproduce: > # use TimestampBasedKeyGenerator > # set > {color:#33}hoodie.deltastreamer.keygen.timebased.timestamp.type{color} = > DATE_STRING > # partitionpath field value is null > when partitionpath field value is null TimestampBasedKeyGenerator will set it > to1L, which can not be parsed correctly. > > {code:java} > // > User class threw exception: java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 1.0 (TID 4, prod-t3-data-lake-007, executor 6): > org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to parse input > partition field :1 > at > org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:156) > at > org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:108) > at > org.apache.hudi.keygen.CustomKeyGenerator.getKey(CustomKeyGenerator.java:78) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$9fce03f0$1(DeltaSync.java:343) > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:394) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1334) > at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1334) > at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364) > at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit is not > specified but scalar it supplied as time value > at > org.apache.hudi.keygen.TimestampBasedKeyGenerator.convertLongTimeToMillis(TimestampBasedKeyGenerator.java:163) > at > org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:138) > ... 29 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-1150] Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator(#1920)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 35b2185 [HUDI-1150] Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator(#1920) 35b2185 is described below commit 35b21855da209c812e006c1afff3d940d5ac2a18 Author: Mathieu AuthorDate: Sun Aug 23 19:56:50 2020 +0800 [HUDI-1150] Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator(#1920) --- .../main/java/org/apache/hudi/DataSourceUtils.java | 6 ++-- .../apache/hudi/keygen/RowKeyGeneratorHelper.java | 2 +- .../hudi/keygen/TimestampBasedKeyGenerator.java| 38 +--- ...rser.java => AbstractHoodieDateTimeParser.java} | 40 +- .../keygen/parser/HoodieDateTimeParserImpl.java| 17 +++-- .../keygen/TestTimestampBasedKeyGenerator.java | 39 +++-- 6 files changed, 109 insertions(+), 33 deletions(-) diff --git a/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java b/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java index ea2cc5c..19316d5 100644 --- a/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java +++ b/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java @@ -39,7 +39,7 @@ import org.apache.hudi.hive.HiveSyncConfig; import org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor; import org.apache.hudi.index.HoodieIndex; import org.apache.hudi.keygen.KeyGenerator; -import org.apache.hudi.keygen.parser.HoodieDateTimeParser; +import org.apache.hudi.keygen.parser.AbstractHoodieDateTimeParser; import org.apache.hudi.table.BulkInsertPartitioner; import org.apache.avro.LogicalTypes; @@ -172,9 +172,9 @@ public class DataSourceUtils { /** * Create a date time parser class for TimestampBasedKeyGenerator, passing in any configs needed. */ - public static HoodieDateTimeParser createDateTimeParser(TypedProperties props, String parserClass) throws IOException { + public static AbstractHoodieDateTimeParser createDateTimeParser(TypedProperties props, String parserClass) throws IOException { try { - return (HoodieDateTimeParser) ReflectionUtils.loadClass(parserClass, props); + return (AbstractHoodieDateTimeParser) ReflectionUtils.loadClass(parserClass, props); } catch (Throwable e) { throw new IOException("Could not load date time parser class " + parserClass, e); } diff --git a/hudi-spark/src/main/java/org/apache/hudi/keygen/RowKeyGeneratorHelper.java b/hudi-spark/src/main/java/org/apache/hudi/keygen/RowKeyGeneratorHelper.java index 02b8492..4c05489 100644 --- a/hudi-spark/src/main/java/org/apache/hudi/keygen/RowKeyGeneratorHelper.java +++ b/hudi-spark/src/main/java/org/apache/hudi/keygen/RowKeyGeneratorHelper.java @@ -146,7 +146,7 @@ public class RowKeyGeneratorHelper { } valueToProcess = (Row) valueToProcess.get(positions.get(index)); } else { // last index -if (valueToProcess.getAs(positions.get(index)).toString().isEmpty()) { +if (null != valueToProcess.getAs(positions.get(index)) && valueToProcess.getAs(positions.get(index)).toString().isEmpty()) { toReturn = EMPTY_RECORDKEY_PLACEHOLDER; break; } diff --git a/hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java b/hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java index 25a52fe..97a7d2e 100644 --- a/hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java +++ b/hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java @@ -26,7 +26,7 @@ import org.apache.hudi.common.util.Option; import org.apache.hudi.exception.HoodieDeltaStreamerException; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieNotSupportedException; -import org.apache.hudi.keygen.parser.HoodieDateTimeParser; +import org.apache.hudi.keygen.parser.AbstractHoodieDateTimeParser; import org.apache.hudi.keygen.parser.HoodieDateTimeParserImpl; import org.apache.avro.generic.GenericRecord; @@ -41,6 +41,7 @@ import java.io.Serializable; import java.io.UnsupportedEncodingException; import java.net.URLEncoder; import java.nio.charset.StandardCharsets; +import java.util.TimeZone; import java.util.concurrent.TimeUnit; import static java.util.concurrent.TimeUnit.MILLISECONDS; @@ -63,10 +64,11 @@ public class TimestampBasedKeyGenerator extends SimpleKeyGenerator { private final String outputDateFormat; private transient Option inputFormatter; private transient DateTimeFormatter partitionFormatter; - private final HoodieDateTimeParser parser; + private final AbstractHoodieDateTimeParser parser; // TimeZone detailed settings reference // https://docs.oracle.com/javase/8/doc
[GitHub] [hudi] yanghua merged pull request #1920: [HUDI-1150] Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator
yanghua merged pull request #1920: URL: https://github.com/apache/hudi/pull/1920 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1215) Ensure all commands in quick start are copy pastable
[ https://issues.apache.org/jira/browse/HUDI-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182665#comment-17182665 ] sivabalan narayanan commented on HUDI-1215: --- sure [~wangxianghu]. sounds good. > Ensure all commands in quick start are copy pastable > > > Key: HUDI-1215 > URL: https://issues.apache.org/jira/browse/HUDI-1215 > Project: Apache Hudi > Issue Type: Bug > Components: Docs >Affects Versions: 0.6.1 >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > I see that delete commands at not directly copy pastable. Fix all such > commands in quick start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1215) Ensure all commands in quick start are copy pastable
[ https://issues.apache.org/jira/browse/HUDI-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-1215: - Assignee: wangxianghu (was: sivabalan narayanan) > Ensure all commands in quick start are copy pastable > > > Key: HUDI-1215 > URL: https://issues.apache.org/jira/browse/HUDI-1215 > Project: Apache Hudi > Issue Type: Bug > Components: Docs >Affects Versions: 0.6.1 >Reporter: sivabalan narayanan >Assignee: wangxianghu >Priority: Major > > I see that delete commands at not directly copy pastable. Fix all such > commands in quick start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] bhasudha opened a new pull request #2011: [DOC] Change reference from`Presto` to `PrestoDB`
bhasudha opened a new pull request #2011: URL: https://github.com/apache/hudi/pull/2011 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
svn commit: r41075 - in /release/hudi/hudi-0.6.0: ./ hudi-0.6.0.src.tgz hudi-0.6.0.src.tgz.asc hudi-0.6.0.src.tgz.sha512
Author: bhavanisudha Date: Sun Aug 23 08:02:57 2020 New Revision: 41075 Log: Apache Hudi 0.6.0 source release Added: release/hudi/hudi-0.6.0/ release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz (with props) release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 Added: release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz == Binary file - no diff available. Propchange: release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz -- svn:mime-type = application/octet-stream Added: release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc == --- release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc (added) +++ release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc Sun Aug 23 08:02:57 2020 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEEf2bNTOmQmDooRnIpMiTyAOH8IXIFAl9CIDMACgkQMiTyAOH8 +IXJpJw//b5kQILuHPgiU/z0JXJkNpH9hs/OwjhUPP30lq9doEkCZ/DU/ZMP34has +JWdYl3Qjin3OGpFoFWKXocxqovO8ACKP5Fo+ktqP5lAVgjZ/W9WXctCaRG/li3VR +QhHOYeMho3s+hK2DOitexw4+PdCRFtVQ5vjSY9UpuvdzxZ5cXrj13wQ3b4N/pMnA +tPTXzVj2UetVZaWQ59A72yWF9MZFeMuI/cRP1DJhVAGw8MNbgSDmZH+5H5avCvj+ +1ycwuTFcutP+6Fe4Acer5MysxaccGRuTbrODMuKjAhIqbo0pxjQ2UCOKDRdHvGB3 +4p1nun3+7gqoTfTqPJ5jbnvCGKGD777S8MysXBxzCyySneeL5Hn/QQxV2Fm7/Wkd +ZrDtp669UkPA4o4MjqxrYpdbV4WkDI4Nggi2ITg6dKznQxSlwnpP+evPCc+rh0+S +Av52nG35cudpseBPfCplonEI+dWJLjyf9O0cju2x2J2XIzIXjMhZ3IdG6Z4cL+n9 +40wdpGizbSdqf1RfC1UTTfndENilmLNbIhfFWhfBWJrXCFaINPUeXrheMQI5pMVC +k8FXN6ol+9XVMJuXElpsO5s3HornM7+OKm71WEwmIDqX0iRkqu0DGz0NonZSHIhS +4EQMbhzzmNczWtpLQ4HyJtkeRj9TaW7csF6gufKw5PhHsSnbKJ4= +=JyJe +-END PGP SIGNATURE- Added: release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 == --- release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 (added) +++ release/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 Sun Aug 23 08:02:57 2020 @@ -0,0 +1 @@ +80255cf9b62c548eebe6306d39acf04f66113482552e7acb653e225644ae4f1ae8892af1a737262ac737737dc9ca4da7117d5a9f05377c79c86e90ee11e7d89a hudi-0.6.0.src.tgz
[jira] [Comment Edited] (HUDI-1215) Ensure all commands in quick start are copy pastable
[ https://issues.apache.org/jira/browse/HUDI-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182617#comment-17182617 ] wangxianghu edited comment on HUDI-1215 at 8/23/20, 8:00 AM: - Hi [~shivnarayan], this issue contains HUDI-1103, may I make 1103 a sub-task of this one? BTW, may I take this issue? :) was (Author: wangxianghu): Hi [~shivnarayan], this issue contains HUDI-1103(https://issues.apache.org/jira/browse/HUDI-1103), may I make 1103 a sub-task of this one? BTW, may I take this issue? :) > Ensure all commands in quick start are copy pastable > > > Key: HUDI-1215 > URL: https://issues.apache.org/jira/browse/HUDI-1215 > Project: Apache Hudi > Issue Type: Bug > Components: Docs >Affects Versions: 0.6.1 >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > I see that delete commands at not directly copy pastable. Fix all such > commands in quick start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1215) Ensure all commands in quick start are copy pastable
[ https://issues.apache.org/jira/browse/HUDI-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182617#comment-17182617 ] wangxianghu commented on HUDI-1215: --- Hi [~shivnarayan], this issue contains HUDI-1103(https://issues.apache.org/jira/browse/HUDI-1103), may I make 1103 a sub-task of this one? BTW, may I take this issue? :) > Ensure all commands in quick start are copy pastable > > > Key: HUDI-1215 > URL: https://issues.apache.org/jira/browse/HUDI-1215 > Project: Apache Hudi > Issue Type: Bug > Components: Docs >Affects Versions: 0.6.1 >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > I see that delete commands at not directly copy pastable. Fix all such > commands in quick start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1216) Create chinese version of pyspark quickstart example
[ https://issues.apache.org/jira/browse/HUDI-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang reassigned HUDI-1216: -- Assignee: wangxianghu (was: vinoyang) > Create chinese version of pyspark quickstart example > - > > Key: HUDI-1216 > URL: https://issues.apache.org/jira/browse/HUDI-1216 > Project: Apache Hudi > Issue Type: Improvement > Components: docs-chinese >Reporter: Balaji Varadarajan >Assignee: wangxianghu >Priority: Major > Fix For: 0.6.1 > > > The quickstart page in Engish (for 0.5.3 version onwards) has pyspark example > but the chinese version do not. -- This message was sent by Atlassian Jira (v8.3.4#803005)
svn commit: r41074 - in /dev/hudi/hudi-0.6.0: ./ hudi-0.6.0.src.tgz hudi-0.6.0.src.tgz.asc hudi-0.6.0.src.tgz.sha512
Author: bhavanisudha Date: Sun Aug 23 07:24:29 2020 New Revision: 41074 Log: Staging source releases for release-0.6.0 Added: dev/hudi/hudi-0.6.0/ dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz (with props) dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 Added: dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz == Binary file - no diff available. Propchange: dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz -- svn:mime-type = application/octet-stream Added: dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc == --- dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc (added) +++ dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.asc Sun Aug 23 07:24:29 2020 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEEf2bNTOmQmDooRnIpMiTyAOH8IXIFAl9CCq4ACgkQMiTyAOH8 +IXKRLA/9E50KbJqMwj7/TsJb93RKauBPj2kcc75F+ZE7Hy6Iypt1++rQ5E22a+FZ +huOjCOsmKBCNMwkpc4NQdGz4iRrEYnQiCjTdNqQRFGA7n8hcXJLKbFSs0AhPR4qJ +F0kWafpVtyt71s2MacPt44VgO3yfRswUmWzKGOeX1hef91fWI4O6JuJEIoeordE9 +KlI1GIckh5L3WyeFnd4EFX7Jc4joaDi4NNJLE+3Hg730lJgZHvXUwatWxPpb0Ccm +WrzUWSZkjkj8hnHHljAMJmbXpOh8zi2IUvxQoiuuv6KhC2GEY/fhF9scchoTdqs2 +dIxlNhVD0y5j0ZSJydZGQLEhC6btD2Encvu1FB+wT/w380izqote1/YbGKsNOh/f +9p6Oenioo3Gqfd6OtsKSaPNNsaNN2PlvSmHdJMlLbLyljYNDjJ24c/QGE1/c+NDa +KDJF3Lj56OhESNR251FHJDVCA6mRTroboCR2VTdlg5QBMlgDaZyyJ1u1atMT2lAF +1ZFNJCV8Q/Y5ospzqVaii9eSZuTiHJh3UEfLZuhvsU2pF4it3ew3G2j8dugE4Kxf +3dGvzIEZNTyI3fBqrTc0Q+/I7ZWiyEUbMiOnp1e6lbjsN0d9QWkLHlw6gaNwdR6Z +NoR/UhH9Y+0gX4/GqnuG86xxdCCuh2KhO6L2wEtH0K/r+BgKKyc= +=Iuob +-END PGP SIGNATURE- Added: dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 == --- dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 (added) +++ dev/hudi/hudi-0.6.0/hudi-0.6.0.src.tgz.sha512 Sun Aug 23 07:24:29 2020 @@ -0,0 +1 @@ +f9c37064631d6c0e6d2bb143f639dfd03b1ca46e882643b5570d5f8819c2805a4f9dd91cb6ced1cdc625aeebdf94b69dd7bc0133444652a2f5f54358f5e43053 hudi-0.6.0.src.tgz