date:20210225

[GitHub] [hudi] Liulietong commented on a change in pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on rea

2021-02-25 Thread GitBox



Liulietong commented on a change in pull request #2584:
URL: https://github.com/apache/hudi/pull/2584#discussion_r583434679



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
##
@@ -104,7 +104,7 @@ public boolean hasNext() {
 throw new HoodieIOException("unable to initialize read with log file 
", io);
   }
   LOG.info("Moving to the next reader for logfile " + 
currentReader.getLogFile());
-  return this.currentReader.hasNext();
+  return this.currentReader.hasNext() || hasNext();

Review comment:
   1. When 'spark.speculation' is enabled, two tasks trying to append one 
log file, the second one will create a new logFile because it can't get lease 
of file. The second task will leave one zero-size log file when the first task 
succeed.
   2. Good idea





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Liulietong commented on a change in pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on rea

2021-02-25 Thread GitBox



Liulietong commented on a change in pull request #2584:
URL: https://github.com/apache/hudi/pull/2584#discussion_r583434679



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
##
@@ -104,7 +104,7 @@ public boolean hasNext() {
 throw new HoodieIOException("unable to initialize read with log file 
", io);
   }
   LOG.info("Moving to the next reader for logfile " + 
currentReader.getLogFile());
-  return this.currentReader.hasNext();
+  return this.currentReader.hasNext() || hasNext();

Review comment:
   1. When 'spark.speculation' is enabled, two tasks trying to append one 
log file, the second one will create a new logFile because it can't get lease 
of file. The second task will leave one zero-size log file when the first task 
succeed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…

2021-02-25 Thread GitBox



codecov-io edited a comment on pull request #2607:
URL: https://github.com/apache/hudi/pull/2607#issuecomment-786454326


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=h1) Report
   > Merging 
[#2607](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=desc) (ab93c26) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/022df0d1b134422f7b6f305cd7ec04b25caa23f0?el=desc)
 (022df0d) will **increase** coverage by `18.28%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2607/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2607   +/-   ##
   =
   + Coverage 51.26%   69.54%   +18.28% 
   + Complexity 3241  363 -2878 
   =
 Files   438   53  -385 
 Lines 20126 1944-18182 
 Branches   2079  235 -1844 
   =
   - Hits  10318 1352 -8966 
   + Misses 8955  458 -8497 
   + Partials853  134  -719 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.54% <ø> (+0.10%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...nal/HoodieBulkInsertDataInternalWriterFactory.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmsyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2ludGVybmFsL0hvb2RpZUJ1bGtJbnNlcnREYXRhSW50ZXJuYWxXcml0ZXJGYWN0b3J5LmphdmE=)
 | | | |
   | 
[...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=)
 | | | |
   | 
[...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=)
 | | | |
   | 
[...oop/realtime/HoodieParquetRealtimeInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZVBhcnF1ZXRSZWFsdGltZUlucHV0Rm9ybWF0LmphdmE=)
 | | | |
   | 
[...mmon/table/log/HoodieUnMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVVbk1lcmdlZExvZ1JlY29yZFNjYW5uZXIuamF2YQ==)
 | | | |
   | 
[...e/timeline/versioning/clean/CleanPlanMigrator.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5QbGFuTWlncmF0b3IuamF2YQ==)
 | | | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | | | |
   | 
[...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=)
 | | | |
   | 
[.../hadoop/utils/HoodieRealtimeRecordReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lUmVjb3JkUmVhZGVyVXRpbHMuamF2YQ==)
 | | | |
   | 
[...util/jvm/OpenJ9MemoryLayoutSpecification32bit.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvanZtL09wZW5KOU1lbW9yeUxheW91dFNwZWNpZmljYXRpb24zMmJpdC5qYXZh)
 | | | |
   | ... and [376 
more](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and

[GitHub] [hudi] codecov-io commented on pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…

2021-02-25 Thread GitBox



codecov-io commented on pull request #2607:
URL: https://github.com/apache/hudi/pull/2607#issuecomment-786454326


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=h1) Report
   > Merging 
[#2607](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=desc) (ab93c26) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/022df0d1b134422f7b6f305cd7ec04b25caa23f0?el=desc)
 (022df0d) will **decrease** coverage by `41.64%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2607/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2607   +/-   ##
   
   - Coverage 51.26%   9.61%   -41.65% 
   + Complexity 3241  48 -3193 
   
 Files   438  53  -385 
 Lines 201261944-18182 
 Branches   2079 235 -1844 
   
   - Hits  10318 187-10131 
   + Misses 89551744 -7211 
   + Partials853  13  -840 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

[GitHub] [hudi] t0il3ts0ap edited a comment on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.

2021-02-25 Thread GitBox



t0il3ts0ap edited a comment on issue #2589:
URL: https://github.com/apache/hudi/issues/2589#issuecomment-786029402


   @satishkotha Ran again on fresh table, still same issue. 
   
   SparkSubmit:
   ```
   spark-submit 
   --master yarn 
   --packages 
org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.7.0
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.scheduler.mode=FAIR 
   --conf spark.task.maxFailures=5 
   --conf spark.rdd.compress=true 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --conf spark.shuffle.service.enabled=true 
   --conf spark.sql.hive.convertMetastoreParquet=false 
   --conf spark.executor.heartbeatInterval=120s 
   --conf spark.network.timeout=600s 
   --conf spark.yarn.max.executor.failures=5 
   --conf spark.sql.catalogImplementation=hive 
   --conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties"
 
   --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties"
 
   --deploy-mode client 
s3://navi-emr-poc/delta-streamer-test/jars/deltastreamer-addons-1.0-SNAPSHOT.jar
 
   --enable-sync 
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
 
   --hoodie-conf hoodie.parquet.compression.codec=snappy 
   --hoodie-conf 
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor 
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable=false 
   --hoodie-conf auto.offset.reset=latest 
   --hoodie-conf hoodie.avro.schema.validate=true
   --table-type MERGE_ON_READ 
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource 
   --schemaprovider-class 
org.apache.hudi.utilities.schema.NullTargetSchemaRegistryProvider  
   --props 
s3://navi-emr-poc/delta-streamer-test/config/kafka-source-nonprod.properties 
   --transformer-class com.navi.transform.DebeziumTransformer 
   --continuous 
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.registry.url=https://dev-dataplatform-schema-registry.np.navi-tech.in/subjects/to_be_deleted_service.public.accounts-value/versions/latest
 
   --hoodie-conf hoodie.datasource.hive_sync.database=to_be_deleted_service 
   --hoodie-conf hoodie.datasource.hive_sync.table=accounts 
   --hoodie-conf hoodie.datasource.write.recordkey.field=id 
   --hoodie-conf hoodie.datasource.write.precombine.field=__lsn 
   --hoodie-conf hoodie.datasource.write.partitionpath.field='' 
   --hoodie-conf 
hoodie.deltastreamer.source.kafka.topic=to_be_deleted_service.public.accounts 
   --hoodie-conf group.id=delta-streamer-to_be_deleted_service-accounts 
   --source-ordering-field __lsn 
   --target-base-path s3://navi-emr-poc/raw-data/to_be_deleted_service/accounts 
   --target-table accounts
   ```
   
   Transformer Code:
   ```
   public class DebeziumTransformer implements Transformer {
   
   public Dataset apply(JavaSparkContext javaSparkContext, 
SparkSession sparkSession,
   Dataset dataset, TypedProperties typedProperties) {
   
   Dataset transformedDataset = dataset
   .withColumn("__deleted", 
dataset.col("__deleted").cast(DataTypes.BooleanType))
   .withColumnRenamed("__deleted", "_hoodie_is_deleted")
   .drop("__op", "__source_ts_ms");
   
   log.info("TRANSFORMER SCHEMA STARTS");
   transformedDataset.printSchema();
   transformedDataset.show();
   log.info("TRANSFORMER SCHEMA ENDS");
   return transformedDataset;
   }
   }
   ```
   
   When I add the column, debezium updates the schema registry instantaneously 
and new records start flowing. Its possible that deltastreamer gets the new 
schema records before even hitting schema registry.
   
   ```
   Caused by: org.apache.avro.AvroTypeException: Found 
hoodie.source.hoodie_source, expecting hoodie.source.hoodie_source, missing 
required field test
   ```
   
   Attaching logs:
   [alog.txt](https://github.com/apache/hudi/files/6044367/alog.txt)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp

2021-02-25 Thread GitBox



liujinhui1994 commented on pull request #2438:
URL: https://github.com/apache/hudi/pull/2438#issuecomment-786452099


   I will add the unit test, and then please review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1643) [Hudi Observability] Framework for reporting stats from executors

2021-02-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1643:
-
Labels: pull-request-available  (was: )

> [Hudi Observability] Framework for reporting stats from executors
> -
>
> Key: HUDI-1643
> URL: https://issues.apache.org/jira/browse/HUDI-1643
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balajee Nagasubramaniam
>Priority: Major
>  Labels: pull-request-available
>
> Hudi Observability framework to report stats from executors, using the 
> distributed registry.
>- Hudi Write Stage Performance stats.
>- Bloom Index stage stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] yanghua commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…

2021-02-25 Thread GitBox



yanghua commented on a change in pull request #2596:
URL: https://github.com/apache/hudi/pull/2596#discussion_r583407438



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -258,4 +260,142 @@ public String getArchivelogFolder() {
   public Properties getProperties() {
 return props;
   }
+
+  public static PropertyBuilder propertyBuilder() {
+return new PropertyBuilder();
+  }
+
+  public static class PropertyBuilder {
+
+private HoodieTableType tableType;
+

Review comment:
   If we do not add comments, let us remove the empty line to make the code 
more compact. wdyt?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -258,4 +260,142 @@ public String getArchivelogFolder() {
   public Properties getProperties() {
 return props;
   }
+
+  public static PropertyBuilder propertyBuilder() {
+return new PropertyBuilder();
+  }
+
+  public static class PropertyBuilder {
+
+private HoodieTableType tableType;
+
+private String tableName;
+
+private String archiveLogFolder;
+
+private String payloadClassName;
+
+private Integer timelineLayoutVersion;
+
+private String baseFileFormat;
+
+private String preCombineField;
+
+private String bootstrapIndexClass;
+
+private String bootstrapBasePath;
+
+private PropertyBuilder() {
+
+}
+
+public PropertyBuilder setTableType(HoodieTableType tableType) {
+  this.tableType = tableType;
+  return this;
+}
+
+public PropertyBuilder setTableType(String tableType) {
+  return setTableType(HoodieTableType.valueOf(tableType));
+}
+
+public PropertyBuilder setTableName(String tableName) {
+  this.tableName = tableName;
+  return this;
+}
+
+public PropertyBuilder setArchiveLogFolder(String archiveLogFolder) {
+  this.archiveLogFolder = archiveLogFolder;
+  return this;
+}
+
+public PropertyBuilder setPayloadClassName(String payloadClassName) {
+  this.payloadClassName = payloadClassName;
+  return this;
+}
+
+public PropertyBuilder setPayloadClass(Class payloadClass) {
+  return setPayloadClassName(payloadClass.getName());
+}
+
+public PropertyBuilder setTimelineLayoutVersion(Integer 
timelineLayoutVersion) {
+  this.timelineLayoutVersion = timelineLayoutVersion;
+  return this;
+}
+
+public PropertyBuilder setBaseFileFormat(String baseFileFormat) {
+  this.baseFileFormat = baseFileFormat;
+  return this;
+}
+
+public PropertyBuilder setPreCombineField(String preCombineField) {
+  this.preCombineField = preCombineField;
+  return this;
+}
+
+public PropertyBuilder setBootstrapIndexClass(String bootstrapIndexClass) {
+  this.bootstrapIndexClass = bootstrapIndexClass;
+  return this;
+}
+
+public PropertyBuilder setBootstrapBasePath(String bootstrapBasePath) {
+  this.bootstrapBasePath = bootstrapBasePath;
+  return this;
+}
+
+public PropertyBuilder fromMetaClient(HoodieTableMetaClient metaClient) {
+  return setTableType(metaClient.getTableType())
+.setTableName(metaClient.getTableConfig().getTableName())
+.setArchiveLogFolder(metaClient.getArchivePath())
+.setPayloadClassName(metaClient.getTableConfig().getPayloadClass());
+}
+
+public Properties build() {

Review comment:
   It seems we do not call this method out of the class. The major purpose 
of the inner class is to build a `HoodieTableMetaClient ` object? If yes,  
renaming to be the table mete client builder sounds more reasonable?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -258,4 +260,142 @@ public String getArchivelogFolder() {
   public Properties getProperties() {
 return props;
   }
+
+  public static PropertyBuilder propertyBuilder() {
+return new PropertyBuilder();
+  }
+
+  public static class PropertyBuilder {
+
+private HoodieTableType tableType;
+
+private String tableName;
+
+private String archiveLogFolder;
+
+private String payloadClassName;
+
+private Integer timelineLayoutVersion;
+
+private String baseFileFormat;
+
+private String preCombineField;
+
+private String bootstrapIndexClass;
+
+private String bootstrapBasePath;
+
+private PropertyBuilder() {
+
+}
+
+public PropertyBuilder setTableType(HoodieTableType tableType) {
+  this.tableType = tableType;
+  return this;
+}
+
+public PropertyBuilder setTableType(String tableType) {
+  return setTableType(HoodieTableType.valueOf(tableType));
+}
+
+public PropertyBuilder setTableName(String tableName) {
+  this.tableName = tableName;
+  return this;
+}
+
+public PropertyBuilder setArchiveLogFolder(String archiveLogFolder) {
+  this.archiveLogFolder = archiveLogFolder;
+

[GitHub] [hudi] nbalajee opened a new pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…

2021-02-25 Thread GitBox



nbalajee opened a new pull request #2607:
URL: https://github.com/apache/hudi/pull/2607


   …tors
   
   ## What is the purpose of the pull request
   Frame work for collecting Hudi Observability stats from the executors.
   
   ## Brief change log
   
   - Using distributed registry, report stats from the executors to the driver, 
to be published using the Graphite reporter.
   - Report Hudi Write stage performance stats.
   - Report Hudi BoomIndex stage stats.
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 -  Added a unit testcase testObservabilityMetricsOnCOW
 -  Manually verified the change by running a job locally.
   
   ## Committer checklist
   
- [ x] Has a corresponding JIRA in PR title & commit

- [ x] Commit message is descriptive of the change

- [ x] CI is green
   
- [x ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #2541: [HUDI-1587] Add latency and freshness support

2021-02-25 Thread GitBox



xushiyan commented on a change in pull request #2541:
URL: https://github.com/apache/hudi/pull/2541#discussion_r583403910



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import java.time.Instant;
+import java.time.format.DateTimeParseException;
+
+public class DateTimeUtils {
+
+  /**
+   * Parse input String to a {@link java.time.Instant}.
+   * @param s Input String should be Epoch time in millisecond or ISO-8601 
format.
+   */
+  public static Instant parseDateTime(String s) throws DateTimeParseException {

Review comment:
   @vinothchandar not quite get it... you meant put it in another package 
or move the method to another util class?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: Travis CI build asf-site

2021-02-25 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new c5d50f0  Travis CI build asf-site
c5d50f0 is described below

commit c5d50f09884f2336dd4d512a87213dcc2c5e57b8
Author: CI 
AuthorDate: Fri Feb 26 06:05:12 2021 +

Travis CI build asf-site
---
 content/docs/powered_by.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html
index 44b80df..a92db54 100644
--- a/content/docs/powered_by.html
+++ b/content/docs/powered_by.html
@@ -483,7 +483,7 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA
 https://www.meetup.com/UberEvents/events/274924537/;>“Meetup 
talk by Nishith Agarwal” - Uber Data Platforms Meetup, Dec 2020
   
   
-https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;>“Apache
 Hudi learning series: Understanding Hudi internals By Abhishek Modi, Balajee 
Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal” - Uber 
Meetup, Virtual, Feb 2021
+https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;>“Apache
 Hudi learning series: Understanding Hudi internals” - By Abhishek Modi, 
Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal, Feb 
2021, Uber Meetup

[jira] [Created] (HUDI-1643) [Hudi Observability] Framework for reporting stats from executors

2021-02-25 Thread Balajee Nagasubramaniam (Jira)

Balajee Nagasubramaniam created HUDI-1643:
-

 Summary: [Hudi Observability] Framework for reporting stats from 
executors
 Key: HUDI-1643
 URL: https://issues.apache.org/jira/browse/HUDI-1643
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Common Core
Reporter: Balajee Nagasubramaniam


Hudi Observability framework to report stats from executors, using the 
distributed registry.
   - Hudi Write Stage Performance stats.
   - Bloom Index stage stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch asf-site updated: [MINOR] Fixing slideshare link (#2606)

2021-02-25 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 1dd7a41  [MINOR] Fixing slideshare link (#2606)
1dd7a41 is described below

commit 1dd7a4164972a733bdb49e3e9f7fdbbddac571d3
Author: n3nash 
AuthorDate: Thu Feb 25 22:02:59 2021 -0800

[MINOR] Fixing slideshare link (#2606)
---
 docs/_docs/1_4_powered_by.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md
index ada616e..9f6694b 100644
--- a/docs/_docs/1_4_powered_by.md
+++ b/docs/_docs/1_4_powered_by.md
@@ -145,7 +145,7 @@ Meanwhile, we build a set of data access standards based on 
Hudi, which provides
 
 21. ["Meetup talk by Nishith 
Agarwal"](https://www.meetup.com/UberEvents/events/274924537/) - Uber Data 
Platforms Meetup, Dec 2020
 
-22. ["Apache Hudi learning series: Understanding Hudi internals By Abhishek 
Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith 
Agarwal"]("https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;)
 - Uber Meetup, Virtual, Feb 2021
+22. ["Apache Hudi learning series: Understanding Hudi 
internals"](https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities)
 - By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, 
Nishith Agarwal, Feb 2021, Uber Meetup
 
 ## Articles

[GitHub] [hudi] vinothchandar merged pull request #2606: [MINOR] Fixing slideshare link

2021-02-25 Thread GitBox



vinothchandar merged pull request #2606:
URL: https://github.com/apache/hudi/pull/2606


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on re

2021-02-25 Thread GitBox



satishkotha commented on a change in pull request #2584:
URL: https://github.com/apache/hudi/pull/2584#discussion_r583398027



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java
##
@@ -104,7 +104,7 @@ public boolean hasNext() {
 throw new HoodieIOException("unable to initialize read with log file 
", io);
   }
   LOG.info("Moving to the next reader for logfile " + 
currentReader.getLogFile());
-  return this.currentReader.hasNext();
+  return this.currentReader.hasNext() || hasNext();

Review comment:
   can we just call hasNext() here to simplify? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…

2021-02-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2596:
URL: https://github.com/apache/hudi/pull/2596#discussion_r583385132



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java
##
@@ -106,10 +106,13 @@ public String createTable(
   throw new IllegalStateException("Table already existing in path : " + 
path);
 }
 
-final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr);
-HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, 
archiveFolder,
-payloadClass, layoutVersion);
-
+new HoodieTableConfig.PropertyBuilder()

Review comment:
   Hi @yanghua , I have update the PR. Please take a look again when you 
have time. Thanks~





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on pull request #2606: [MINOR] Fixing slideshare link

2021-02-25 Thread GitBox



n3nash commented on pull request #2606:
URL: https://github.com/apache/hudi/pull/2606#issuecomment-786408716


   Verified locally that it works.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash opened a new pull request #2606: [MINOR] Fixing slideshare link

2021-02-25 Thread GitBox



n3nash opened a new pull request #2606:
URL: https://github.com/apache/hudi/pull/2606


   Fixing broken link
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2600: [HUDI-1638] Some improvements to BucketAssignFunction

2021-02-25 Thread GitBox



danny0405 commented on a change in pull request #2600:
URL: https://github.com/apache/hudi/pull/2600#discussion_r583369329



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -136,15 +137,10 @@ public void open(Configuration parameters) throws 
Exception {
 new SerializableConfiguration(this.hadoopConf),
 new FlinkTaskContextSupplier(getRuntimeContext()));
 this.bucketAssigner = new BucketAssigner(context, writeConfig);
-List allPartitionPaths = FSUtils.getAllPartitionPaths(this.context,
-this.conf.getString(FlinkOptions.PATH), false, false, false);
-final int parallelism = getRuntimeContext().getNumberOfParallelSubtasks();
-final int maxParallelism = 
getRuntimeContext().getMaxNumberOfParallelSubtasks();
-final int taskID = getRuntimeContext().getIndexOfThisSubtask();
-// reference: org.apache.flink.streaming.api.datastream.KeyedStream
-this.initialPartitionsToLoad = allPartitionPaths.stream()
-.filter(partition -> 
KeyGroupRangeAssignment.assignKeyToParallelOperator(partition, maxParallelism, 
parallelism) == taskID)
-.collect(Collectors.toList());
+
+// initialize and check the partitions load state
+loadInitialPartitions();
+checkPartitionsLoaded();

Review comment:
   Sure, feel free to promote it, thanks ~





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hk-lrzy commented on a change in pull request #2600: [HUDI-1638] Some improvements to BucketAssignFunction

2021-02-25 Thread GitBox



hk-lrzy commented on a change in pull request #2600:
URL: https://github.com/apache/hudi/pull/2600#discussion_r583366344



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -136,15 +137,10 @@ public void open(Configuration parameters) throws 
Exception {
 new SerializableConfiguration(this.hadoopConf),
 new FlinkTaskContextSupplier(getRuntimeContext()));
 this.bucketAssigner = new BucketAssigner(context, writeConfig);
-List allPartitionPaths = FSUtils.getAllPartitionPaths(this.context,
-this.conf.getString(FlinkOptions.PATH), false, false, false);
-final int parallelism = getRuntimeContext().getNumberOfParallelSubtasks();
-final int maxParallelism = 
getRuntimeContext().getMaxNumberOfParallelSubtasks();
-final int taskID = getRuntimeContext().getIndexOfThisSubtask();
-// reference: org.apache.flink.streaming.api.datastream.KeyedStream
-this.initialPartitionsToLoad = allPartitionPaths.stream()
-.filter(partition -> 
KeyGroupRangeAssignment.assignKeyToParallelOperator(partition, maxParallelism, 
parallelism) == taskID)
-.collect(Collectors.toList());
+
+// initialize and check the partitions load state
+loadInitialPartitions();
+checkPartitionsLoaded();

Review comment:
   I have some doubts about this, because the current key of the keycontext 
has not been set, so the key state of flink cannot be accessed in the open 
method. Should we move this method to processElement? If possible, I can submit 
a patch. thanks. 
   
   ```
private void checkKeyNamespacePreconditions(K key, N namespace) {
Preconditions.checkNotNull(key, "No key set. This method should 
not be called outside of a keyed context.");
Preconditions.checkNotNull(namespace, "Provided namespace is 
null.");
}
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hk-lrzy closed pull request #2604: [hudi-1639][hudi-flink] fix BucketAssigner npe

2021-02-25 Thread GitBox



hk-lrzy closed pull request #2604:
URL: https://github.com/apache/hudi/pull/2604


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: Travis CI build asf-site

2021-02-25 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 1f101fe  Travis CI build asf-site
1f101fe is described below

commit 1f101fe71756e5210bcffd63d5c3e243829ce1cd
Author: CI 
AuthorDate: Fri Feb 26 03:14:49 2021 +

Travis CI build asf-site
---
 content/docs/powered_by.html | 4 
 1 file changed, 4 insertions(+)

diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html
index 3f88d68..44b80df 100644
--- a/content/docs/powered_by.html
+++ b/content/docs/powered_by.html
@@ -482,6 +482,9 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA
   
 https://www.meetup.com/UberEvents/events/274924537/;>“Meetup 
talk by Nishith Agarwal” - Uber Data Platforms Meetup, Dec 2020
   
+  
+https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;>“Apache
 Hudi learning series: Understanding Hudi internals By Abhishek Modi, Balajee 
Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal” - Uber 
Meetup, Virtual, Feb 2021
+  
 
 
 Articles
@@ -505,6 +508,7 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA
   https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/;>“Can
 Big Data Solutions Be Affordable?”
   https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/;>“Building
 High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go”
   https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b;>“Data
 Lake Change Capture using Apache Hudi  Amazon AMS/EMR Part 2”
+  https://eng.uber.com/apache-hudi-graduation/;>“Building a large 
scale transactional data lake at Uber using Apache Hudi” - Engineering Blog 
By Nishith Agarwal
 
 
 Powered by

[GitHub] [hudi] codejoyan commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-02-25 Thread GitBox



codejoyan commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-786379571


   This is the spark version of the cluster being used at work so I will have 
to use Spark 2.3 until there is an upgrade. Since the documentation says: 
**Hudi works with Spark-2.x**, I was wondering there would be some workaround 
to this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links (#2602)

2021-02-25 Thread lamberken

This is an automated email from the ASF dual-hosted git repository.

lamberken pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new eeb146a  [HUDI 1642] Adding Hudi Learning series presentation & Uber 
eng blog links (#2602)
eeb146a is described below

commit eeb146a8fa3a368cd329de175e8c803c46116826
Author: n3nash 
AuthorDate: Thu Feb 25 19:06:06 2021 -0800

[HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links 
(#2602)
---
 docs/_docs/1_4_powered_by.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md
index 6bac956..ada616e 100644
--- a/docs/_docs/1_4_powered_by.md
+++ b/docs/_docs/1_4_powered_by.md
@@ -145,6 +145,8 @@ Meanwhile, we build a set of data access standards based on 
Hudi, which provides
 
 21. ["Meetup talk by Nishith 
Agarwal"](https://www.meetup.com/UberEvents/events/274924537/) - Uber Data 
Platforms Meetup, Dec 2020
 
+22. ["Apache Hudi learning series: Understanding Hudi internals By Abhishek 
Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith 
Agarwal"]("https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;)
 - Uber Meetup, Virtual, Feb 2021
+
 ## Articles
 
 You can check out [our blog pages](https://hudi.apache.org/blog.html) for 
content written by our committers/contributors.
@@ -165,6 +167,7 @@ You can check out [our blog 
pages](https://hudi.apache.org/blog.html) for conten
 14. ["Can Big Data Solutions Be 
Affordable?"](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/)
 15. ["Building High-Performance Data Lake Using Apache Hudi and Alluxio at 
T3Go"](https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/)
 16. ["Data Lake Change Capture using Apache Hudi & Amazon AMS/EMR Part 
2"](https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b)
+17. ["Building a large scale transactional data lake at Uber using Apache 
Hudi"](https://eng.uber.com/apache-hudi-graduation/) - Engineering Blog By 
Nishith Agarwal
 
 ## Powered by

[GitHub] [hudi] lamber-ken merged pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links

2021-02-25 Thread GitBox



lamber-ken merged pull request #2602:
URL: https://github.com/apache/hudi/pull/2602


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lamber-ken edited a comment on pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links

2021-02-25 Thread GitBox



lamber-ken edited a comment on pull request #2602:
URL: https://github.com/apache/hudi/pull/2602#issuecomment-786378892


   Thanks @n3nash  LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lamber-ken commented on pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links

2021-02-25 Thread GitBox



lamber-ken commented on pull request #2602:
URL: https://github.com/apache/hudi/pull/2602#issuecomment-786378892


   Thanks @n3nash  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lamber-ken commented on a change in pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links

2021-02-25 Thread GitBox



lamber-ken commented on a change in pull request #2602:
URL: https://github.com/apache/hudi/pull/2602#discussion_r583352619



##
File path: content/docs/0.5.3-powered_by.html
##
@@ -462,13 +462,17 @@ Talks  
Presentations
   
 https://drive.google.com/open?id=1Pk_WdFxfEZxMMfAOn0R8-m3ALkcN6G9e;>“Building
 a near real-time, high-performance data warehouse based on Apache Hudi and 
Apache Kylin” - By ShaoFeng Shi March 2020, Apache Hudi  Apache Kylin 
Online Meetup, China
   
+  
+https://drive.google.com/file/d/1K-WsQAQf-C96pLRTpM4Ha1v7wRf1pZ9S/view?usp=sharing;>“Apache
 Hudi learning series: Understanding Hudi internals” - By Abhishek Modi, 
Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal Feb 
2021, Uber Meetup, Virtual

Review comment:
    





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…

2021-02-25 Thread GitBox



codecov-io edited a comment on pull request #2596:
URL: https://github.com/apache/hudi/pull/2596#issuecomment-784717451


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=h1) Report
   > Merging 
[#2596](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=desc) (c71fe74) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc)
 (43a0776) will **increase** coverage by `18.41%`.
   > The diff coverage is `69.56%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2596/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2596   +/-   ##
   =
   + Coverage 51.14%   69.55%   +18.41% 
   + Complexity 3215  363 -2852 
   =
 Files   438   53  -385 
 Lines 20041 1961-18080 
 Branches   2064  235 -1829 
   =
   - Hits  10250 1364 -8886 
   + Misses 8946  463 -8483 
   + Partials845  134  -711 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.55% <69.56%> (+0.09%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.00% <50.00%> (ø)` | `52.00 <0.00> (+2.00)` | |
   | 
[...udi/utilities/deltastreamer/BootstrapExecutor.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvQm9vdHN0cmFwRXhlY3V0b3IuamF2YQ==)
 | `82.35% <100.00%> (+2.80%)` | `6.00 <0.00> (ø)` | |
   | 
[...hudi/utilities/sources/helpers/KafkaOffsetGen.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9LYWZrYU9mZnNldEdlbi5qYXZh)
 | `85.84% <0.00%> (-2.94%)` | `20.00% <0.00%> (+4.00%)` | :arrow_down: |
   | 
[...apache/hudi/timeline/service/handlers/Handler.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvSGFuZGxlci5qYXZh)
 | | | |
   | 
[.../spark/sql/hudi/streaming/HoodieStreamSource.scala](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvaHVkaS9zdHJlYW1pbmcvSG9vZGllU3RyZWFtU291cmNlLnNjYWxh)
 | | | |
   | 
[.../java/org/apache/hudi/common/util/HoodieTimer.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvSG9vZGllVGltZXIuamF2YQ==)
 | | | |
   | 
[...i/common/util/collection/ExternalSpillableMap.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9FeHRlcm5hbFNwaWxsYWJsZU1hcC5qYXZh)
 | | | |
   | 
[...g/apache/hudi/common/table/log/LogReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Mb2dSZWFkZXJVdGlscy5qYXZh)
 | | | |
   | 
[...n/java/org/apache/hudi/common/HoodieCleanStat.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL0hvb2RpZUNsZWFuU3RhdC5qYXZh)
 | | | |
   | 
[...del/OverwriteNonDefaultsWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZU5vbkRlZmF1bHRzV2l0aExhdGVzdEF2cm9QYXlsb2FkLmphdmE=)
 | | | |
   | ... and [374 
more](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git

[GitHub] [hudi] garyli1019 commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

2021-02-25 Thread GitBox



garyli1019 commented on issue #2498:
URL: https://github.com/apache/hudi/issues/2498#issuecomment-786364265


   I am seeing the same problem when the compiled spark distribution is 
different from the runtime spark distribution. Compile hudi jar against the 
runtime spark distribution should fix this problem. @green2k @andormarkus 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read.

2021-02-25 Thread GitBox



garyli1019 commented on pull request #2584:
URL: https://github.com/apache/hudi/pull/2584#issuecomment-786361769


   hi @satishkotha , this PR seems related to #2583 , would you take a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…

2021-02-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2596:
URL: https://github.com/apache/hudi/pull/2596#discussion_r583334728



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java
##
@@ -106,10 +106,13 @@ public String createTable(
   throw new IllegalStateException("Table already existing in path : " + 
path);
 }
 
-final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr);
-HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, 
archiveFolder,
-payloadClass, layoutVersion);
-
+new HoodieTableConfig.PropertyBuilder()

Review comment:
   Sound good! I will have a try.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…

2021-02-25 Thread GitBox



yanghua commented on a change in pull request #2596:
URL: https://github.com/apache/hudi/pull/2596#discussion_r583313615



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java
##
@@ -106,10 +106,13 @@ public String createTable(
   throw new IllegalStateException("Table already existing in path : " + 
path);
 }
 
-final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr);
-HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, 
archiveFolder,
-payloadClass, layoutVersion);
-
+new HoodieTableConfig.PropertyBuilder()

Review comment:
   WDYT about adding a static method named e.g. 
`HoodieTableConfig.propertyBuilder()` so that we can make the code more 
readable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash merged pull request #2565: [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap.

2021-02-25 Thread GitBox



n3nash merged pull request #2565:
URL: https://github.com/apache/hudi/pull/2565


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (#2565)

2021-02-25 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 022df0d  [HUDI-1611] Added a configuration to allow specific 
directories to be filtered out during Metadata Table bootstrap. (#2565)
022df0d is described below

commit 022df0d1b134422f7b6f305cd7ec04b25caa23f0
Author: Prashant Wason 
AuthorDate: Thu Feb 25 16:52:28 2021 -0800

[HUDI-1611] Added a configuration to allow specific directories to be 
filtered out during Metadata Table bootstrap. (#2565)
---
 .../metadata/HoodieBackedTableMetadataWriter.java |  6 ++
 .../hudi/metadata/TestHoodieBackedMetadata.java   | 19 +--
 .../hudi/common/config/HoodieMetadataConfig.java  | 15 +++
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 003ec7d..5aae7b7 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -318,6 +318,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 Map> partitionToFileStatus = new HashMap<>();
 final int fileListingParallelism = 
metadataWriteConfig.getFileListingParallelism();
 SerializableConfiguration conf = new 
SerializableConfiguration(datasetMetaClient.getHadoopConf());
+final String dirFilterRegex = 
datasetWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
 
 while (!pathsToList.isEmpty()) {
   int listingParallelism = Math.min(fileListingParallelism, 
pathsToList.size());
@@ -331,6 +332,11 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
   // If the listing reveals a directory, add it to queue. If the listing 
reveals a hoodie partition, add it to
   // the results.
   dirToFileListing.forEach(p -> {
+if (!dirFilterRegex.isEmpty() && 
p.getLeft().getName().matches(dirFilterRegex)) {
+  LOG.info("Ignoring directory " + p.getLeft() + " which matches the 
filter regex " + dirFilterRegex);
+  return;
+}
+
 List filesInDir = Arrays.stream(p.getRight()).parallel()
 .filter(fs -> 
!fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
 .collect(Collectors.toList());
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
index 3697ec1..4fa0bc8 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
@@ -148,14 +148,22 @@ public class TestHoodieBackedMetadata extends 
HoodieClientTestHarness {
 final String nonPartitionDirectory = 
HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-nonpartition";
 Files.createDirectories(Paths.get(basePath, nonPartitionDirectory));
 
+// Three directories which are partitions but will be ignored due to filter
+final String filterDirRegex = ".*-filterDir\\d|\\..*";
+final String filteredDirectoryOne = 
HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-filterDir1";
+final String filteredDirectoryTwo = 
HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-filterDir2";
+final String filteredDirectoryThree = ".backups";
+
 // Create some commits
 HoodieTestTable testTable = HoodieTestTable.of(metaClient);
-testTable.withPartitionMetaFiles("p1", "p2")
+testTable.withPartitionMetaFiles("p1", "p2", filteredDirectoryOne, 
filteredDirectoryTwo, filteredDirectoryThree)
 .addCommit("001").withBaseFilesInPartition("p1", 
10).withBaseFilesInPartition("p2", 10, 10)
 .addCommit("002").withBaseFilesInPartition("p1", 
10).withBaseFilesInPartition("p2", 10, 10, 10)
 .addInflightCommit("003").withBaseFilesInPartition("p1", 
10).withBaseFilesInPartition("p2", 10);
 
-try (SparkRDDWriteClient client = new SparkRDDWriteClient(engineContext, 
getWriteConfig(true, true))) {
+final HoodieWriteConfig writeConfig = getWriteConfigBuilder(true, true, 
false)
+
.withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(true).withDirectoryFilterRegex(filterDirRegex).build()).build();
+try (SparkRDDWriteClient client = new SparkRDDWriteClient(engineContext, 
writeConfig)) {
   client.startCommitWithTime("005");
 
   List partitions =

[hudi] branch master updated: Fixing README for hudi test suite long running job (#2578)

2021-02-25 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9f5e8cc  Fixing README for hudi test suite long running job (#2578)
9f5e8cc is described below

commit 9f5e8cc7c3789fd3658f52f960fb5cbc8e4efce9
Author: Sivabalan Narayanan 
AuthorDate: Thu Feb 25 19:50:18 2021 -0500

Fixing README for hudi test suite long running job (#2578)
---
 hudi-integ-test/README.md | 92 +--
 1 file changed, 58 insertions(+), 34 deletions(-)

diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md
index ff64ed1..06de263 100644
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -270,20 +270,31 @@ spark-submit \
 --compact-scheduling-minshare 1
 ``` 
 
-For long running test suite, validation has to be done differently. Idea is to 
run same dag in a repeated manner. 
-Hence "ValidateDatasetNode" is introduced which will read entire input data 
and compare it with hudi contents both via 
-spark datasource and hive table via spark sql engine.
+## Running long running test suite in Local Docker environment
+
+For long running test suite, validation has to be done differently. Idea is to 
run same dag in a repeated manner for 
+N iterations. Hence "ValidateDatasetNode" is introduced which will read entire 
input data and compare it with hudi 
+contents both via spark datasource and hive table via spark sql engine. Hive 
validation is configurable. 
 
 If you have "ValidateDatasetNode" in your dag, do not replace hive jars as 
instructed above. Spark sql engine does not 
-go well w/ hive2* jars. So, after running docker setup, just copy 
test.properties and your dag of interest and you are 
-good to go ahead. 
+go well w/ hive2* jars. So, after running docker setup, follow the below 
steps. 
+```
+docker cp 
packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar
 adhoc-2:/opt/
+docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
+```
+Also copy your dag of interest to adhoc-2:/opt/
+```
+docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
+```
 
 For repeated runs, two additional configs need to be set. "dag_rounds" and 
"dag_intermittent_delay_mins". 
-This means that your dag will be repeated for N times w/ a delay of Y mins 
between each round.
+This means that your dag will be repeated for N times w/ a delay of Y mins 
between each round. Note: complex-dag-cow.yaml
+already has all these configs set. So no changes required just to try it out. 
+
+Also, ValidateDatasetNode can be configured in two ways. Either with 
"delete_input_data" set to true or without 
+setting the config. When "delete_input_data" is set for ValidateDatasetNode, 
once validation is complete, entire input 
+data will be deleted. So, suggestion is to use this ValidateDatasetNode as the 
last node in the dag with "delete_input_data". 
 
-Also, ValidateDatasetNode can be configured in two ways. Either with 
"delete_input_data: true" set or not set. 
-When "delete_input_data" is set for ValidateDatasetNode, once validation is 
complete, entire input data will be deleted. 
-So, suggestion is to use this ValidateDatasetNode as the last node in the dag 
with "delete_input_data". 
 Example dag: 
 ```
  Insert
@@ -294,7 +305,7 @@ Example dag:
 If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 
10, then this dag will run for 10 times 
 with 10 mins delay between every run. At the end of every run, records written 
as part of this round will be validated. 
 At the end of each validation, all contents of input are deleted.   
-For eg: incase of above dag, 
+To illustrate each round 
 ```
 Round1: 
 insert => inputPath/batch1
@@ -323,6 +334,12 @@ every cycle.
 
 Lets see an example where you don't set "delete_input_data" as part of 
Validation. 
 ```
+ Insert
+ Upsert
+ ValidateDatasetNode 
+```
+Here is the illustration of each round
+```
 Round1: 
 insert => inputPath/batch1
 upsert -> inputPath/batch2
@@ -382,27 +399,14 @@ Above dag was just an example for illustration purposes. 
But you can make it com
 Validate w/o deleting
 Upsert
 Validate w/ deletion
-```
-With this dag, you can set the two additional configs "dag_rounds" and 
"dag_intermittent_delay_mins" and have a long 
-running test suite. 
-
 ```
-dag_rounds: 1
-dag_intermittent_delay_mins: 10
-dag_content:
-Insert
-Upsert
-Delete
-Validate w/o deleting
-Insert
-Rollback
-Validate w/o deleting
-Upsert
-Validate w/ deletion
 
+Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/, 
you can run the following command to execute 
+the test suite job. 
 ```
-
-Sample COW command with repeated runs. 
+docker exec -it adhoc-2 /bin/bash
+```
+Sample

[GitHub] [hudi] n3nash merged pull request #2578: [MINOR] Fixing Hudi Test suite readme for long running job

2021-02-25 Thread GitBox



n3nash merged pull request #2578:
URL: https://github.com/apache/hudi/pull/2578


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1642) Add Links to Uber engineering blog and meet up slides

2021-02-25 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1642:
-

 Summary: Add Links to Uber engineering blog and meet up slides
 Key: HUDI-1642
 URL: https://issues.apache.org/jira/browse/HUDI-1642
 Project: Apache Hudi
  Issue Type: Task
  Components: Docs
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] n3nash commented on a change in pull request #2602: Adding Hudi Learning series presentation & Uber eng blog links

2021-02-25 Thread GitBox



n3nash commented on a change in pull request #2602:
URL: https://github.com/apache/hudi/pull/2602#discussion_r583286829



##
File path: content/docs/0.5.3-powered_by.html
##
@@ -462,13 +462,17 @@ Talks  
Presentations
   
 https://drive.google.com/open?id=1Pk_WdFxfEZxMMfAOn0R8-m3ALkcN6G9e;>“Building
 a near real-time, high-performance data warehouse based on Apache Hudi and 
Apache Kylin” - By ShaoFeng Shi March 2020, Apache Hudi  Apache Kylin 
Online Meetup, China
   
+  
+https://drive.google.com/file/d/1K-WsQAQf-C96pLRTpM4Ha1v7wRf1pZ9S/view?usp=sharing;>“Apache
 Hudi learning series: Understanding Hudi internals” - By Abhishek Modi, 
Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal Feb 
2021, Uber Meetup, Virtual

Review comment:
   Thanks @lamber-ken for pointing that out. I've made those changes, can 
you take a look ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] afeldman1 closed issue #2399: [SUPPORT] Hudi deletes not being properly commited

2021-02-25 Thread GitBox



afeldman1 closed issue #2399:
URL: https://github.com/apache/hudi/issues/2399


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] afeldman1 commented on issue #2399: [SUPPORT] Hudi deletes not being properly commited

2021-02-25 Thread GitBox



afeldman1 commented on issue #2399:
URL: https://github.com/apache/hudi/issues/2399#issuecomment-786204481


   Apologies for the delayed response. And thank you to @bvaradar for the 
initial hint. The issue turned out to be caused not by the keys but another one 
of the configuration properties. When the table was originally created there 
was a field specified for the "hoodie.datasource.write.precombine.field" 
property. When doing the delete operation, it did not seem to make sense to 
have to specify this field, so I left it out of the config. It seems if the 
table was created with this property set, all future operations on it are 
required to specify the same value for this property. I'm not sure if this 
requirement necessarily makes sense?
   
   In terms of @nsivabalan 's questions:
   This was not related to hudi versions.
   This was found before releasing this piece to production, but seems to 
always act the same way.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kpurella commented on issue #2240: [SUPPORT] Performance Issue : HUDI MOR ,UPSERT Job running forever

2021-02-25 Thread GitBox



kpurella commented on issue #2240:
URL: https://github.com/apache/hudi/issues/2240#issuecomment-786183675


   @vinothchandar Sure ,i will !!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] toninis commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2021-02-25 Thread GitBox



toninis commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-786183618


   @vinothchandar 
   Sorry I took so long to respond . It had worked and compiled successfully . 
I probably had missed something at the time .
   
   Thanks for your response at the time .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1641) Issue for Integrating Hudi with Kafka using Avro Schema

2021-02-25 Thread PRASHANT BHOSALE (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PRASHANT BHOSALE updated HUDI-1641:
---
Description: 
I am trying to integrate Hudi with Kafka topic.
teps followed :
 # Created Kafka topic in Confluent with schema defined in schema registry.
 # Using kafka-avro-console-producer, I am trying to produce data.
 # Running Hudi Delta Streamer in continuous mode to consume the data.

Infrastructure :
 # AWS EMR
 # Spark 2.4.4
 # Hudi Utility ( Tried with 0.6.0 and 0.7.0 )
 # Avro ( Tried avro-1.8.2, avro-1.9.2 and avro-1.10.0 )


{code:java}
21/02/24 13:02:08 ERROR TaskResultGetter: Exception while getting task result
org.apache.spark.SparkException: Error reading attempting to read avro data -- 
encountered an unknown fingerprint: 103427103938146401, not sure what schema to 
use.  This could happen if you registered additional schemas after starting 
your spark context.
at 
org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:141)
at 
org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:138)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:79)
at 
org.apache.spark.serializer.GenericAvroSerializer.deserializeDatum(GenericAvroSerializer.scala:137)
at 
org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:162)
at 
org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:47)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:371)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/02/24 13:02:08 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all 
completed, from pool
21/02/24 13:02:08 INFO YarnScheduler: Cancelling stage 1
21/02/24 13:02:08 INFO YarnScheduler: Killing all running tasks in stage 1: 
Stage cancelled
21/02/24 13:02:08 INFO DAGScheduler: ResultStage 1 (isEmpty at 
DeltaSync.java:380) failed in 1.415 s due to Job aborted due to stage failure: 
Exception while getting task result: org.apache.spark.SparkException: Error 
reading attempting to read avro data -- encountered an unknown fingerprint: 
103427103938146401, not sure what schema to use.  This could happen if you 
registered additional schemas after starting your spark context.
21/02/24 13:02:08 INFO DAGScheduler: Job 5 failed: isEmpty at 
DeltaSync.java:380, took 1.422265 s
21/02/24 13:02:08 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to 
exception
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: org.apache.spark.SparkException: Error reading 
attempting to read avro data -- encountered an unknown fingerprint: 
103427103938146401, not sure what schema to use.  This could happen if you 
registered additional schemas after starting your spark context.
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at

[GitHub] [hudi] t0il3ts0ap edited a comment on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.

2021-02-25 Thread GitBox



t0il3ts0ap edited a comment on issue #2589:
URL: https://github.com/apache/hudi/issues/2589#issuecomment-786029402


   @satishkotha Ran again on fresh table, still same issue. 
   
   SparkSubmit:
   ```
   spark-submit 
   --master yarn 
   --packages 
org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.7.0
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.scheduler.mode=FAIR 
   --conf spark.task.maxFailures=5 
   --conf spark.rdd.compress=true 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --conf spark.shuffle.service.enabled=true 
   --conf spark.sql.hive.convertMetastoreParquet=false 
   --conf spark.executor.heartbeatInterval=120s 
   --conf spark.network.timeout=600s 
   --conf spark.yarn.max.executor.failures=5 
   --conf spark.sql.catalogImplementation=hive 
   --conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties"
 
   --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties"
 
   --deploy-mode client 
s3://navi-emr-poc/delta-streamer-test/jars/deltastreamer-addons-1.0-SNAPSHOT.jar
 
   --enable-sync 
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
 
   --hoodie-conf hoodie.parquet.compression.codec=snappy 
   --hoodie-conf 
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor 
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable=false 
   --hoodie-conf auto.offset.reset=latest 
   --hoodie-conf hoodie.avro.schema.validate=true
   --table-type MERGE_ON_READ 
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource 
   --schemaprovider-class 
org.apache.hudi.utilities.schema.NullTargetSchemaRegistryProvider  
   --props 
s3://navi-emr-poc/delta-streamer-test/config/kafka-source-nonprod.properties 
   --transformer-class com.navi.transform.DebeziumTransformer 
   --continuous 
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.registry.url=https://dev-dataplatform-schema-registry.np.navi-tech.in/subjects/to_be_deleted_service.public.accounts-value/versions/latest
 
   --hoodie-conf hoodie.datasource.hive_sync.database=to_be_deleted_service 
   --hoodie-conf hoodie.datasource.hive_sync.table=accounts 
   --hoodie-conf hoodie.datasource.write.recordkey.field=id 
   --hoodie-conf hoodie.datasource.write.precombine.field=__lsn 
   --hoodie-conf hoodie.datasource.write.partitionpath.field='' 
   --hoodie-conf 
hoodie.deltastreamer.source.kafka.topic=to_be_deleted_service.public.accounts 
   --hoodie-conf group.id=delta-streamer-to_be_deleted_service-accounts 
   --source-ordering-field __lsn 
   --target-base-path s3://navi-emr-poc/raw-data/to_be_deleted_service/accounts 
   --target-table accounts
   ```
   
   Transformer Code:
   ```
   public class DebeziumTransformer implements Transformer {
   
   public Dataset apply(JavaSparkContext javaSparkContext, 
SparkSession sparkSession,
   Dataset dataset, TypedProperties typedProperties) {
   
   Dataset transformedDataset = dataset
   .withColumn("__deleted", 
dataset.col("__deleted").cast(DataTypes.BooleanType))
   .withColumnRenamed("__deleted", "_hoodie_is_deleted")
   .drop("__op", "__source_ts_ms");
   
   log.info("TRANSFORMER SCHEMA STARTS");
   transformedDataset.printSchema();
   transformedDataset.show();
   log.info("TRANSFORMER SCHEMA ENDS");
   return transformedDataset;
   }
   }
   ```
   
   When I add the column, debezium updates the schema registry instantaneously 
and new records start flowing. Its possible that deltastreamer gets the new 
schema records before even hitting schema registry.
   
   Attaching logs:
   [alog.txt](https://github.com/apache/hudi/files/6044367/alog.txt)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] t0il3ts0ap commented on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.

2021-02-25 Thread GitBox



t0il3ts0ap commented on issue #2589:
URL: https://github.com/apache/hudi/issues/2589#issuecomment-786029402


   Ran again on fresh table, still same issue. 
   
   SparkSubmit:
   ```
   spark-submit 
   --master yarn 
   --packages 
org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.7.0
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.scheduler.mode=FAIR 
   --conf spark.task.maxFailures=5 
   --conf spark.rdd.compress=true 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --conf spark.shuffle.service.enabled=true 
   --conf spark.sql.hive.convertMetastoreParquet=false 
   --conf spark.executor.heartbeatInterval=120s 
   --conf spark.network.timeout=600s 
   --conf spark.yarn.max.executor.failures=5 
   --conf spark.sql.catalogImplementation=hive 
   --conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties"
 
   --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties"
 
   --deploy-mode client 
s3://navi-emr-poc/delta-streamer-test/jars/deltastreamer-addons-1.0-SNAPSHOT.jar
 
   --enable-sync 
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
 
   --hoodie-conf hoodie.parquet.compression.codec=snappy 
   --hoodie-conf 
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor 
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable=false 
   --hoodie-conf auto.offset.reset=latest 
   --hoodie-conf hoodie.avro.schema.validate=true
   --table-type MERGE_ON_READ 
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource 
   --schemaprovider-class 
org.apache.hudi.utilities.schema.NullTargetSchemaRegistryProvider  
   --props 
s3://navi-emr-poc/delta-streamer-test/config/kafka-source-nonprod.properties 
   --transformer-class com.navi.transform.DebeziumTransformer 
   --continuous 
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.registry.url=https://dev-dataplatform-schema-registry.np.navi-tech.in/subjects/to_be_deleted_service.public.accounts-value/versions/latest
 
   --hoodie-conf hoodie.datasource.hive_sync.database=to_be_deleted_service 
   --hoodie-conf hoodie.datasource.hive_sync.table=accounts 
   --hoodie-conf hoodie.datasource.write.recordkey.field=id 
   --hoodie-conf hoodie.datasource.write.precombine.field=__lsn 
   --hoodie-conf hoodie.datasource.write.partitionpath.field='' 
   --hoodie-conf 
hoodie.deltastreamer.source.kafka.topic=to_be_deleted_service.public.accounts 
   --hoodie-conf group.id=delta-streamer-to_be_deleted_service-accounts 
   --source-ordering-field __lsn 
   --target-base-path s3://navi-emr-poc/raw-data/to_be_deleted_service/accounts 
   --target-table accounts
   ```
   
   Transformer Code:
   ```
   public class DebeziumTransformer implements Transformer {
   
   public Dataset apply(JavaSparkContext javaSparkContext, 
SparkSession sparkSession,
   Dataset dataset, TypedProperties typedProperties) {
   
   Dataset transformedDataset = dataset
   .withColumn("__deleted", 
dataset.col("__deleted").cast(DataTypes.BooleanType))
   .withColumnRenamed("__deleted", "_hoodie_is_deleted")
   .drop("__op", "__source_ts_ms");
   
   log.info("TRANSFORMER SCHEMA STARTS");
   transformedDataset.printSchema();
   transformedDataset.show();
   log.info("TRANSFORMER SCHEMA ENDS");
   return transformedDataset;
   }
   }
   ```
   
   When I add the column, debezium updates the schema registry instantaneously 
and new records start flowing. Its possible that deltastreamer gets the new 
schema records before even hitting schema registry.
   
   Attaching logs:
   [alog.txt](https://github.com/apache/hudi/files/6044367/alog.txt)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-1269) Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-1269.
---
Fix Version/s: 0.8.0
   Resolution: Fixed

> Make whether the failure of connect hive affects hudi ingest process 
> configurable
> -
>
> Key: HUDI-1269
> URL: https://issues.apache.org/jira/browse/HUDI-1269
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: wangxianghu#1
>Assignee: liujinhui
>Priority: Minor
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.8.0
>
>
> Currently, In an ETL pipeline(eg, kafka -> hudi -> hive), If the process of 
> hudi to hive failed, the job is still running.
> I think we can add a switch to control the job behavior(fail or keep running) 
> when kafka to hudi is ok, while hudi to hive failed, leave the choice to 
> user. since ingesting data to hudi and sync to hive is a complete task in 
> some scenes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch master updated: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (#2443)

2021-02-25 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c2197a  [HUDI-1269] Make whether the failure of connect hive affects 
hudi ingest process configurable (#2443)
8c2197a is described below

commit 8c2197ae5e9c139e488a33f5a507b79bfa2f6f27
Author: liujinhui <965147...@qq.com>
AuthorDate: Thu Feb 25 23:09:32 2021 +0800

[HUDI-1269] Make whether the failure of connect hive affects hudi ingest 
process configurable (#2443)

Co-authored-by: Sivabalan Narayanan 
---
 .../main/java/org/apache/hudi/DataSourceUtils.java |  2 +
 .../scala/org/apache/hudi/DataSourceOptions.scala  |  2 +
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  1 +
 .../java/org/apache/hudi/hive/HiveSyncConfig.java  |  4 ++
 .../java/org/apache/hudi/hive/HiveSyncTool.java| 74 +-
 .../org/apache/hudi/hive/TestHiveSyncTool.java | 22 +++
 6 files changed, 76 insertions(+), 29 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
index 18c51e3..632a155 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
@@ -293,6 +293,8 @@ public class DataSourceUtils {
 DataSourceWriteOptions.DEFAULT_HIVE_USE_JDBC_OPT_VAL()));
 hiveSyncConfig.autoCreateDatabase = 
Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_AUTO_CREATE_DATABASE_OPT_KEY(),
 DataSourceWriteOptions.DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY()));
+hiveSyncConfig.ignoreExceptions = 
Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_IGNORE_EXCEPTIONS_OPT_KEY(),
+DataSourceWriteOptions.DEFAULT_HIVE_IGNORE_EXCEPTIONS_OPT_KEY()));
 hiveSyncConfig.skipROSuffix = 
Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SKIP_RO_SUFFIX(),
 DataSourceWriteOptions.DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL()));
 hiveSyncConfig.supportTimestamp = 
Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SUPPORT_TIMESTAMP(),
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
index 965b35c..4b8e97c 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
@@ -347,6 +347,7 @@ object DataSourceWriteOptions {
   val HIVE_USE_PRE_APACHE_INPUT_FORMAT_OPT_KEY = 
"hoodie.datasource.hive_sync.use_pre_apache_input_format"
   val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc"
   val HIVE_AUTO_CREATE_DATABASE_OPT_KEY = 
"hoodie.datasource.hive_sync.auto_create_database"
+  val HIVE_IGNORE_EXCEPTIONS_OPT_KEY = 
"hoodie.datasource.hive_sync.ignore_exceptions"
   val HIVE_SKIP_RO_SUFFIX = "hoodie.datasource.hive_sync.skip_ro_suffix"
   val HIVE_SUPPORT_TIMESTAMP = "hoodie.datasource.hive_sync.support_timestamp"
 
@@ -365,6 +366,7 @@ object DataSourceWriteOptions {
   val DEFAULT_USE_PRE_APACHE_INPUT_FORMAT_OPT_VAL = "false"
   val DEFAULT_HIVE_USE_JDBC_OPT_VAL = "true"
   val DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY = "true"
+  val DEFAULT_HIVE_IGNORE_EXCEPTIONS_OPT_KEY = "false"
   val DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL = "false"
   val DEFAULT_HIVE_SUPPORT_TIMESTAMP = "false"
 
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index f5ba6c8..ef28191 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -374,6 +374,7 @@ private[hudi] object HoodieSparkSqlWriter {
 hiveSyncConfig.useJdbc = parameters(HIVE_USE_JDBC_OPT_KEY).toBoolean
 hiveSyncConfig.useFileListingFromMetadata = 
parameters(HoodieMetadataConfig.METADATA_ENABLE_PROP).toBoolean
 hiveSyncConfig.verifyMetadataFileListing = 
parameters(HoodieMetadataConfig.METADATA_VALIDATE_PROP).toBoolean
+hiveSyncConfig.ignoreExceptions = 
parameters.get(HIVE_IGNORE_EXCEPTIONS_OPT_KEY).exists(r => r.toBoolean)
 hiveSyncConfig.supportTimestamp = 
parameters.get(HIVE_SUPPORT_TIMESTAMP).exists(r => r.toBoolean)
 hiveSyncConfig.autoCreateDatabase = 
parameters.get(HIVE_AUTO_CREATE_DATABASE_OPT_KEY).exists(r => r.toBoolean)
 hiveSyncConfig.decodePartition = 
parameters.getOrElse(URL_ENCODE_PARTITIONING_OPT_KEY,
diff --git

[jira] [Updated] (HUDI-1269) Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1269:
--
Status: Open  (was: New)

> Make whether the failure of connect hive affects hudi ingest process 
> configurable
> -
>
> Key: HUDI-1269
> URL: https://issues.apache.org/jira/browse/HUDI-1269
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: wangxianghu#1
>Assignee: liujinhui
>Priority: Minor
>  Labels: pull-request-available, user-support-issues
>
> Currently, In an ETL pipeline(eg, kafka -> hudi -> hive), If the process of 
> hudi to hive failed, the job is still running.
> I think we can add a switch to control the job behavior(fail or keep running) 
> when kafka to hudi is ok, while hudi to hive failed, leave the choice to 
> user. since ingesting data to hudi and sync to hive is a complete task in 
> some scenes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1269) Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1269:
--
Status: In Progress  (was: Open)

> Make whether the failure of connect hive affects hudi ingest process 
> configurable
> -
>
> Key: HUDI-1269
> URL: https://issues.apache.org/jira/browse/HUDI-1269
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: wangxianghu#1
>Assignee: liujinhui
>Priority: Minor
>  Labels: pull-request-available, user-support-issues
>
> Currently, In an ETL pipeline(eg, kafka -> hudi -> hive), If the process of 
> hudi to hive failed, the job is still running.
> I think we can add a switch to control the job behavior(fail or keep running) 
> when kafka to hudi is ok, while hudi to hive failed, leave the choice to 
> user. since ingesting data to hudi and sync to hive is a complete task in 
> some scenes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan merged pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-25 Thread GitBox



nsivabalan merged pull request #2443:
URL: https://github.com/apache/hudi/pull/2443


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Rap70r commented on issue #2586: [SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?

2021-02-25 Thread GitBox



Rap70r commented on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-785941366


   Hi nsivabalan,
   
   Thank you for your reply.
   
   * Incremental updates include both inserts and updates. Mostly updates.
   * We can try increasing retention version to a higher value and improve 
readers time.
   * We would prefer sticking with COPY_ON_WRITE for now.
   
   I was wondering if we should look into table caching in Spark: 
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html
   
   As this would cache the entire table into disk/memory and would work with 
that. The only downside I can think of is space issues. Are there any other 
disadvantages when using cache and persist?
   
   Also, we're looking into improving reader's speed with combination of 
increasing retention version value. When reading a S3 Hudi dataset structure, 
does the number of partition affect the speed of readers? For example, if the 
table is partitioned into 200 folders or 1000 folders, by choosing different 
columns, would that affect the speed when reading the table by using Snapshot 
query: https://hudi.apache.org/docs/querying_data.html#spark-snap-query
   
   Thank you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #2588: [SUPPORT] Cannot create hive connection

2021-02-25 Thread GitBox



rubenssoto commented on issue #2588:
URL: https://github.com/apache/hudi/issues/2588#issuecomment-785933427


   @bvaradar I think it is a hive issue, I'm trying to increase hive heap size, 
I hope it helps.
   
   I process the tables in threads, so I have almost 20 hive connections open.
   
   Do you have any experience with Hudi and Hive? Because Hudi probably execute 
simple queries to verify table schema
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2448: [SUPPORT] deltacommit for client 172.16.116.102 already exists

2021-02-25 Thread GitBox



bvaradar commented on issue #2448:
URL: https://github.com/apache/hudi/issues/2448#issuecomment-785932137


   @root18039532923 : Please look at 
https://hudi.apache.org/blog/async-compaction-deployment-model/ for running 
async compactions



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2555: [SUPPORT] Trying and Understanding Clustering

2021-02-25 Thread GitBox



bvaradar commented on issue #2555:
URL: https://github.com/apache/hudi/issues/2555#issuecomment-785930358


   For bulkinsert, you need to size the parallelism to control. Please see 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whatperformance/ingestlatencycanIexpectforHudiwriting



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2588: [SUPPORT] Cannot create hive connection

2021-02-25 Thread GitBox



bvaradar commented on issue #2588:
URL: https://github.com/apache/hudi/issues/2588#issuecomment-785928870


   @rubenssoto : The stack-trace does not contain Hudi in it. So, I dont know 
how to help in this regard. Regarding high cpu load on hive server, Are you 
also running hive queries apart from HMS integration ?
   
   @n3nash @nsivabalan : Any other ideas ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1641) Issue for Integrating Hudi with Kafka using Avro Schema

2021-02-25 Thread PRASHANT BHOSALE (Jira)

PRASHANT BHOSALE created HUDI-1641:
--

 Summary: Issue for Integrating Hudi with Kafka using Avro Schema
 Key: HUDI-1641
 URL: https://issues.apache.org/jira/browse/HUDI-1641
 Project: Apache Hudi
  Issue Type: Bug
  Components: DeltaStreamer, Spark Integration, Utilities
Reporter: PRASHANT BHOSALE
 Fix For: 0.7.0


I am trying to integrate Hudi with Kafka topic.

Steps followed : 
 # Created Kafka topic in confluent with schema defined.
 # Using kafka-avro-console-producer, I am trying to produce data.
 # I am running Hudi Delta Streamer in continuous mode.

I am getting the below error : 
{code:java}
21/02/25 13:48:14 ERROR TaskResultGetter: Exception while getting task 
result21/02/25 13:48:14 ERROR TaskResultGetter: Exception while getting task 
resultorg.apache.spark.SparkException: Error reading attempting to read avro 
data -- encountered an unknown fingerprint: 103427103938146401, not sure what 
schema to use.  This could happen if you registered additional schemas after 
starting your spark context. at 
org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:141)
 at 
org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:138)
 at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:79) at 
org.apache.spark.serializer.GenericAvroSerializer.deserializeDatum(GenericAvroSerializer.scala:137)
 at 
org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:162)
 at 
org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:47)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391)
 at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:371)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88) at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
 at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
 at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)21/02/25 13:48:14 INFO YarnScheduler: 
Removed TaskSet 14.0, whose tasks have all completed, from pool 21/02/25 
13:48:14 INFO YarnScheduler: Cancelling stage 1421/02/25 13:48:14 INFO 
YarnScheduler: Killing all running tasks in stage 14: Stage cancelled21/02/25 
13:48:14 INFO DAGScheduler: ResultStage 14 (isEmpty at DeltaSync.java:380) 
failed in 0.696 s due to Job aborted due to stage failure: Exception while 
getting task result: org.apache.spark.SparkException: Error reading attempting 
to read avro data -- encountered an unknown fingerprint: 103427103938146401, 
not sure what schema to use.  This could happen if you registered additional 
schemas after starting your spark context.21/02/25 13:48:14 INFO DAGScheduler: 
Job 8 failed: isEmpty at DeltaSync.java:380, took 0.704193 s21/02/25 13:48:14 
ERROR HoodieDeltaStreamer: Shutting down delta-sync due to 
exceptionorg.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: org.apache.spark.SparkException: Error 
reading attempting to read avro data -- encountered an unknown fingerprint: 
103427103938146401, not sure what schema to use.  This could happen if you 
registered additional schemas after starting your spark context. at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
 at

[GitHub] [hudi] bvaradar commented on a change in pull request #2520: [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init

2021-02-25 Thread GitBox



bvaradar commented on a change in pull request #2520:
URL: https://github.com/apache/hudi/pull/2520#discussion_r582853801



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/DefaultBootstrapIndex.java
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bootstrap.index;
+
+import org.apache.hudi.common.model.BootstrapFileMapping;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+
+import java.util.List;
+
+/**
+ * Default Bootstrap Index , which is a emtpy implement and not do anything.
+ */
+public class DefaultBootstrapIndex extends BootstrapIndex {

Review comment:
   Rename to NoOpBootstrapIndex





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2605: [SUPPORT] How to reload a writeConfig from a existed hudi path ?

2021-02-25 Thread GitBox



bvaradar commented on issue #2605:
URL: https://github.com/apache/hudi/issues/2605#issuecomment-785906565


   Ideally, spark data-source should provide that option (like 
optionFromFile(...). Not sure if there is anything like that. 
   
   Created : https://issues.apache.org/jira/browse/HUDI-1640?filter=-2



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file

2021-02-25 Thread Balaji Varadarajan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290922#comment-17290922
 ] 

Balaji Varadarajan commented on HUDI-1640:
--

[~shivnarayan]: Can you vet this and add to the work queue ?

> Implement Spark Datasource option to read hudi configs from properties file
> ---
>
> Key: HUDI-1640
> URL: https://issues.apache.org/jira/browse/HUDI-1640
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
> Provide config option like "hoodie.datasource.props.file" to load all the 
> options from a file.
>  
> GH: https://github.com/apache/hudi/issues/2605



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file

2021-02-25 Thread Balaji Varadarajan (Jira)

Balaji Varadarajan created HUDI-1640:


 Summary: Implement Spark Datasource option to read hudi configs 
from properties file
 Key: HUDI-1640
 URL: https://issues.apache.org/jira/browse/HUDI-1640
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: Balaji Varadarajan


Provide config option like "hoodie.datasource.props.file" to load all the 
options from a file.

 

GH: https://github.com/apache/hudi/issues/2605



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] bvaradar commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-02-25 Thread GitBox



bvaradar commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-785896778


   I was unable to setup spark-2.3.0 in my setup.  But,with spark-2.4.4, this 
works fine as below. Can you use spark-2.4.x version. spark-2.3 seems too old 
though ?
   
   `21/02/25 05:14:26 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   Spark context Web UI available at http://ip-192-168-1-81.ec2.internal:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1614258873363).
   Spark session available as 'spark'.
   Welcome to
   __
/ __/__  ___ _/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
 /_/
   
   Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_251)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceWriteOptions
   
   scala> import org.apache.hudi.QuickstartUtils.getQuickstartWriteConfigs
   import org.apache.hudi.QuickstartUtils.getQuickstartWriteConfigs
   
   scala> import org.apache.hudi.config.{HoodieCompactionConfig, 
HoodieStorageConfig, HoodieWriteConfig}
   import org.apache.hudi.config.{HoodieCompactionConfig, HoodieStorageConfig, 
HoodieWriteConfig}
   
   scala> import org.apache.spark.sql.{SaveMode, SparkSession}
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   scala> import org.apache.spark.sql.functions.{col, concat, lit}
   import org.apache.spark.sql.functions.{col, concat, lit}
   
   scala>
   
   scala> val inputDF = spark.read.format("csv").option("header", 
"true").load("file:///tmp/input.csv")
   inputDF: org.apache.spark.sql.DataFrame = [col_1: string, col_2: string ... 
12 more fields]
   
   scala> val formattedDF = inputDF.selectExpr("col_1", "cast(col_2 as integer) 
col_2",
|   "cast(col_3 as short) col_3", "col_4", "col_5", "cast(col_6 as 
byte) col_6", "cast(col_7 as decimal(9,2)) col_7",
|   "cast(col_8 as decimal(9,2)) col_8", "cast(col_9 as timestamp) 
col_9", "col_10", "cast(col_11 as timestamp) col_11",
|   "col_12", "cntry_cd", "cast(bus_dt as date) bus_dt")
   formattedDF: org.apache.spark.sql.DataFrame = [col_1: string, col_2: int ... 
12 more fields]
   
   scala> formattedDF.printSchema()
   root
|-- col_1: string (nullable = true)
|-- col_2: integer (nullable = true)
|-- col_3: short (nullable = true)
|-- col_4: string (nullable = true)
|-- col_5: string (nullable = true)
|-- col_6: byte (nullable = true)
|-- col_7: decimal(9,2) (nullable = true)
|-- col_8: decimal(9,2) (nullable = true)
|-- col_9: timestamp (nullable = true)
|-- col_10: string (nullable = true)
|-- col_11: timestamp (nullable = true)
|-- col_12: string (nullable = true)
|-- cntry_cd: string (nullable = true)
|-- bus_dt: date (nullable = true)
   
   
   scala> formattedDF.show
   
++-+-+-+-+-+-+-++---++---++--+
   |   col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8| 
  col_9| col_10|  col_11| col_12|cntry_cd|bus_dt|
   
++-+-+-+-+-+-+-++---++---++--+
   |7IN00716079317820...|  716|3|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91|  IN|2021-02-01|
   |7IN00716079317820...|  716|2|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91|  IN|2021-02-01|
   |7IN00716079317820...|  716|1|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91|  IN|2021-02-01|
   |AU700716079381819...| 5700|5|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  IN|2021-02-02|
   |AU700716079381819...| 5700|6|   AB|  INR| null| 4.00| 1.97|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  IN|2021-02-02|
   |AU700716079381819...| 5700|4|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  AU|2021-02-01|
   |AU700716079381819...| 5700|3|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  AU|2021-02-01|
   |AU700716079381819...| 5700|1|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|

[GitHub] [hudi] codecov-io edited a comment on pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-25 Thread GitBox



codecov-io edited a comment on pull request #2443:
URL: https://github.com/apache/hudi/pull/2443#issuecomment-760147630


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=h1) Report
   > Merging 
[#2443](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=desc) (7baf5de) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/b0010bf3b449a9d2e01955b0746b795e22e577db?el=desc)
 (b0010bf) will **increase** coverage by `0.11%`.
   > The diff coverage is `76.66%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2443/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2443  +/-   ##
   
   + Coverage 51.15%   51.26%   +0.11% 
   - Complexity 3212 3234  +22 
   
 Files   436  438   +2 
 Lines 1998720107 +120 
 Branches   2057 2073  +16 
   
   + Hits  1022410308  +84 
   - Misses 8922 8947  +25 
   - Partials841  852  +11 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.35% <ø> (-0.04%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `46.85% <ø> (+1.41%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.71% <0.00%> (-0.03%)` | `0.00 <0.00> (ø)` | |
   | hudisync | `49.62% <79.31%> (+1.00%)` | `0.00 <8.00> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.46% <ø> (+0.07%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | `49.82% <0.00%> (+<0.01%)` | `0.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/hive/HiveSyncTool.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZVN5bmNUb29sLmphdmE=)
 | `70.37% <77.77%> (+1.29%)` | `17.00 <8.00> (+3.00)` | |
   | 
[...main/java/org/apache/hudi/hive/HiveSyncConfig.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZVN5bmNDb25maWcuamF2YQ==)
 | `97.72% <100.00%> (+0.10%)` | `2.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/common/util/CleanerUtils.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ2xlYW5lclV0aWxzLmphdmE=)
 | `47.72% <0.00%> (-15.91%)` | `6.00% <0.00%> (ø%)` | |
   | 
[...udi/operator/partitioner/BucketAssignFunction.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9wYXJ0aXRpb25lci9CdWNrZXRBc3NpZ25GdW5jdGlvbi5qYXZh)
 | `84.78% <0.00%> (-7.53%)` | `21.00% <0.00%> (+13.00%)` | :arrow_down: |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
   | 
[...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==)
 | `49.31% <0.00%> (-0.46%)` | `61.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=)
 | `13.84% <0.00%> (-0.44%)` | `3.00% <0.00%> (ø%)` | |
   | 
[...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==)
 | `55.66% <0.00%> (-0.44%)` | `38.00% <0.00%> (ø%)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query

2021-02-25 Thread GitBox



codecov-io edited a comment on pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#issuecomment-751218636


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=h1) Report
   > Merging 
[#2378](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=desc) (ef2107f) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/97864a48c1979ca2b3f0579cf26bba81fba7e46c?el=desc)
 (97864a4) will **decrease** coverage by `41.57%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2378/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2378   +/-   ##
   
   - Coverage 51.19%   9.61%   -41.58% 
   + Complexity 3226  48 -3178 
   
 Files   438  53  -385 
 Lines 200891944-18145 
 Branches   2068 235 -1833 
   
   - Hits  10285 187-10098 
   + Misses 89581744 -7214 
   + Partials846  13  -833 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.61% <ø> (-59.85%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

[jira] [Resolved] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-1367.
---
Fix Version/s: 0.8.0
   Resolution: Fixed

> Make delastreamer transition from dfsSouce to kafkasouce
> 
>
> Key: HUDI-1367
> URL: https://issues.apache.org/jira/browse/HUDI-1367
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Currently, after using dfsSouce to write hudi, if you want to use kafkasouce 
> to continue writing hudi, you need to specify the kafka checkpoint value. I 
> will make the program automatically get the latest or earliest offect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reopened HUDI-1367:
---

> Make delastreamer transition from dfsSouce to kafkasouce
> 
>
> Key: HUDI-1367
> URL: https://issues.apache.org/jira/browse/HUDI-1367
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
>
> Currently, after using dfsSouce to write hudi, if you want to use kafkasouce 
> to continue writing hudi, you need to specify the kafka checkpoint value. I 
> will make the program automatically get the latest or earliest offect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1367:
--
Status: Patch Available  (was: In Progress)

> Make delastreamer transition from dfsSouce to kafkasouce
> 
>
> Key: HUDI-1367
> URL: https://issues.apache.org/jira/browse/HUDI-1367
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
>
> Currently, after using dfsSouce to write hudi, if you want to use kafkasouce 
> to continue writing hudi, you need to specify the kafka checkpoint value. I 
> will make the program automatically get the latest or earliest offect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce

2021-02-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1367:
--
Status: Closed  (was: Patch Available)

> Make delastreamer transition from dfsSouce to kafkasouce
> 
>
> Key: HUDI-1367
> URL: https://issues.apache.org/jira/browse/HUDI-1367
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
>
> Currently, after using dfsSouce to write hudi, if you want to use kafkasouce 
> to continue writing hudi, you need to specify the kafka checkpoint value. I 
> will make the program automatically get the latest or earliest offect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch master updated: [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (#2227)

2021-02-25 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 617cc24  [HUDI-1367] Make deltaStreamer transition from dfsSouce to 
kafkasouce (#2227)
617cc24 is described below

commit 617cc24ad1a28196b872df5663e9e0f48cd7f0fa
Author: liujinhui <965147...@qq.com>
AuthorDate: Thu Feb 25 20:08:13 2021 +0800

[HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce 
(#2227)


Co-authored-by: Sivabalan Narayanan 
---
 .../utilities/sources/helpers/KafkaOffsetGen.java  |  52 ++-
 .../functional/TestHoodieDeltaStreamer.java| 151 +++--
 .../TestHoodieMultiTableDeltaStreamer.java |  21 +--
 .../hudi/utilities/sources/TestKafkaSource.java|   2 +-
 .../utilities/testutils/UtilitiesTestBase.java |  17 ++-
 5 files changed, 213 insertions(+), 30 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
index fc7ba79..e37ec0a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
@@ -21,6 +21,7 @@ package org.apache.hudi.utilities.sources.helpers;
 import org.apache.hudi.DataSourceUtils;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieDeltaStreamerException;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieNotSupportedException;
 
@@ -40,6 +41,8 @@ import java.util.HashSet;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 
 /**
@@ -49,6 +52,12 @@ public class KafkaOffsetGen {
 
   private static final Logger LOG = LogManager.getLogger(KafkaOffsetGen.class);
 
+  /**
+   * kafka checkpoint Pattern.
+   * Format: topic_name,partition_num:offset,partition_num:offset,
+   */
+  private final Pattern pattern = Pattern.compile(".*,.*:.*");
+
   public static class CheckpointUtils {
 
 /**
@@ -148,7 +157,8 @@ public class KafkaOffsetGen {
 
 private static final String KAFKA_TOPIC_NAME = 
"hoodie.deltastreamer.source.kafka.topic";
 private static final String MAX_EVENTS_FROM_KAFKA_SOURCE_PROP = 
"hoodie.deltastreamer.kafka.source.maxEvents";
-private static final KafkaResetOffsetStrategies DEFAULT_AUTO_RESET_OFFSET 
= KafkaResetOffsetStrategies.LATEST;
+private static final String KAFKA_AUTO_RESET_OFFSETS = 
"hoodie.deltastreamer.source.kafka.auto.reset.offsets";
+private static final KafkaResetOffsetStrategies 
DEFAULT_KAFKA_AUTO_RESET_OFFSETS = KafkaResetOffsetStrategies.LATEST;
 public static final long DEFAULT_MAX_EVENTS_FROM_KAFKA_SOURCE = 500;
 public static long maxEventsFromKafkaSource = 
DEFAULT_MAX_EVENTS_FROM_KAFKA_SOURCE;
   }
@@ -156,15 +166,29 @@ public class KafkaOffsetGen {
   private final HashMap kafkaParams;
   private final TypedProperties props;
   protected final String topicName;
+  private KafkaResetOffsetStrategies autoResetValue;
 
   public KafkaOffsetGen(TypedProperties props) {
 this.props = props;
+
 kafkaParams = new HashMap<>();
 for (Object prop : props.keySet()) {
   kafkaParams.put(prop.toString(), props.get(prop.toString()));
 }
 DataSourceUtils.checkRequiredProperties(props, 
Collections.singletonList(Config.KAFKA_TOPIC_NAME));
 topicName = props.getString(Config.KAFKA_TOPIC_NAME);
+String kafkaAutoResetOffsetsStr = 
props.getString(Config.KAFKA_AUTO_RESET_OFFSETS, 
Config.DEFAULT_KAFKA_AUTO_RESET_OFFSETS.name());
+boolean found = false;
+for (KafkaResetOffsetStrategies entry: 
KafkaResetOffsetStrategies.values()) {
+  if (entry.name().toLowerCase().equals(kafkaAutoResetOffsetsStr)) {
+found = true;
+autoResetValue = entry;
+break;
+  }
+}
+if (!found) {
+  throw new HoodieDeltaStreamerException(Config.KAFKA_AUTO_RESET_OFFSETS + 
" config set to unknown value " + kafkaAutoResetOffsetsStr);
+}
   }
 
   public OffsetRange[] getNextOffsetRanges(Option lastCheckpointStr, 
long sourceLimit, HoodieDeltaStreamerMetrics metrics) {
@@ -186,8 +210,6 @@ public class KafkaOffsetGen {
 fromOffsets = checkupValidOffsets(consumer, lastCheckpointStr, 
topicPartitions);
 
metrics.updateDeltaStreamerKafkaDelayCountMetrics(delayOffsetCalculation(lastCheckpointStr,
 topicPartitions, consumer));
   } else {
-KafkaResetOffsetStrategies autoResetValue = KafkaResetOffsetStrategies
-.valueOf(props.getString("auto.offset.reset",

[GitHub] [hudi] nsivabalan merged pull request #2227: [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce

2021-02-25 Thread GitBox



nsivabalan merged pull request #2227:
URL: https://github.com/apache/hudi/pull/2227


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2519: [HUDI-1573] Spark Sql Writer support Multi preCmp Field

2021-02-25 Thread GitBox



codecov-io edited a comment on pull request #2519:
URL: https://github.com/apache/hudi/pull/2519#issuecomment-771782258


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=h1) Report
   > Merging 
[#2519](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=desc) (6f0fd84) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/06dc7c7fd8a867a1e1da90f7dc19b0cc2da69bba?el=desc)
 (06dc7c7) will **increase** coverage by `18.14%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2519/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2519   +/-   ##
   =
   + Coverage 51.22%   69.36%   +18.14% 
   + Complexity 3230  356 -2874 
   =
 Files   438   53  -385 
 Lines 20093 1929-18164 
 Branches   2069  230 -1839 
   =
   - Hits  10292 1338 -8954 
   + Misses 8954  458 -8496 
   + Partials847  133  -714 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.36% <ø> (-0.16%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `64.53% <0.00%> (-1.17%)` | `32.00% <0.00%> (ø%)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.00% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
   | 
[.../apache/hudi/metadata/HoodieTableMetadataUtil.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllVGFibGVNZXRhZGF0YVV0aWwuamF2YQ==)
 | | | |
   | 
[...a/org/apache/hudi/common/util/CompactionUtils.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ29tcGFjdGlvblV0aWxzLmphdmE=)
 | | | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatReader.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRSZWFkZXIuamF2YQ==)
 | | | |
   | 
[.../apache/hudi/common/table/log/HoodieLogFormat.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXQuamF2YQ==)
 | | | |
   | 
[...main/scala/org/apache/hudi/HoodieWriterUtils.scala](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVdyaXRlclV0aWxzLnNjYWxh)
 | | | |
   | 
[.../org/apache/hudi/common/model/BaseAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VBdnJvUGF5bG9hZC5qYXZh)
 | | | |
   | 
[...di/source/JsonStringToHoodieRecordMapFunction.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvSnNvblN0cmluZ1RvSG9vZGllUmVjb3JkTWFwRnVuY3Rpb24uamF2YQ==)
 | | | |
   | 
[...ava/org/apache/hudi/payload/AWSDmsAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvcGF5bG9hZC9BV1NEbXNBdnJvUGF5bG9hZC5qYXZh)
 | | | |
   | ... and [374 
more](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

[GitHub] [hudi] rakeshramakrishnan edited a comment on issue #2439: [SUPPORT] Unable to sync with external hive metastore via metastore uris in the thrift protocol

2021-02-25 Thread GitBox



rakeshramakrishnan edited a comment on issue #2439:
URL: https://github.com/apache/hudi/issues/2439#issuecomment-785795722


   @nsivabalan : There are no errors, however through hudi, the connection is 
made to the local hive metastore (from spark). It doesn't connect to the 
external hive metastore. 
   
   But, without hudi, the spark catalog fetches tables hive tables from the 
external metastore
   ```
   spark = SparkSession.builder \
   .appName("test-hudi-hive-sync") \
   .enableHiveSupport() \
   .config("hive.metastore.uris", metastore_uri) \
   .getOrCreate()
   
   print("Before {}".format(spark.catalog.listTables())) --> returns tables 
from `metastore_uri`
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rakeshramakrishnan commented on issue #2439: [SUPPORT] Unable to sync with external hive metastore via metastore uris in the thrift protocol

2021-02-25 Thread GitBox



rakeshramakrishnan commented on issue #2439:
URL: https://github.com/apache/hudi/issues/2439#issuecomment-785795722


   @nsivabalan : There are no errors, however through hudi, the connection is 
made to the local hive metastore (from spark). It doesn't connect to the 
external hive metastore. 
   
   But, without hudi, the spark catalog fetches tables hive tables from the 
external metastore
   ```
   spark = SparkSession.builder \
   .appName("test-hudi-hive-sync") \
   .enableHiveSupport() \
   .config("hive.metastore.uris", metastore_uri) \
   .getOrCreate()
   
   print("Before {}".format(spark.catalog.listTables()))
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Xoln commented on a change in pull request #2520: [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init

2021-02-25 Thread GitBox



Xoln commented on a change in pull request #2520:
URL: https://github.com/apache/hudi/pull/2520#discussion_r582717002



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/DefaultBootstrapIndex.java
##
@@ -0,0 +1,61 @@
+package org.apache.hudi.common.bootstrap.index;
+
+import org.apache.hudi.common.model.BootstrapFileMapping;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+
+import java.util.*;
+
+public class DefaultBootstrapIndex extends BootstrapIndex {

Review comment:
   yes , you are right.  When init AbstractTableFileSystemView , 
hoodie.bootstrap.index.class read the default config and to be init . I think 
it's better to be the index which has no operation and has not any 
dependencies. (Because there is not check whether index is null at some code)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io commented on pull request #2604: [hudi-1639][hudi-flink] fix BucketAssigner npe

2021-02-25 Thread GitBox



codecov-io commented on pull request #2604:
URL: https://github.com/apache/hudi/pull/2604#issuecomment-785704660


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=h1) Report
   > Merging 
[#2604](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=desc) (fed6575) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/06dc7c7fd8a867a1e1da90f7dc19b0cc2da69bba?el=desc)
 (06dc7c7) will **decrease** coverage by `41.52%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2604/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2604   +/-   ##
   
   - Coverage 51.22%   9.69%   -41.53% 
   + Complexity 3230  48 -3182 
   
 Files   438  53  -385 
 Lines 200931929-18164 
 Branches   2069 230 -1839 
   
   - Hits  10292 187-10105 
   + Misses 89541729 -7225 
   + Partials847  13  -834 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.69% <ø> (-59.83%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

75 matches

Mail list logo