[GitHub] [hudi] bvaradar commented on issue #1875: EMR + Spark Batch job + HUDI + Hive external Metastore (MySQL RDS Instance) failed with No Suitable Driver

2020-07-24 Thread GitBox


bvaradar commented on issue #1875:
URL: https://github.com/apache/hudi/issues/1875#issuecomment-663814729


   @umehrot2 @bschell @zhedoubushishi : Can you guys chime in here.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-24 Thread GitBox


bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663813751


   This is a spark tuning issue in general. The slowness is due to memory 
pressure and node failures due to it. Atleast in one of the batches, I see task 
failures (and retries) during reading from source parquet file itself. 
   
   As mentioned in the suggestion  "Consider boosting 
spark.yarn.executor.memoryOverhead or disabling 
yarn.nodemanager.vmem-check-enabled because of YARN-4714.", you need to 
increase spark.yarn.executor.memoryOverhead. You are running 2 executors per 
machine with 8GB room for each which may not have lot of room. If you are 
trying to compare parquet write with hudi, note that hudi adds metadata fields 
which gives incremental pull, indexing and other benefits. If your original 
record size is very small and comparable to metadata overhead and your setup is 
already close to hitting the limit for parquet write, then you would need to 
give more resources. 
   
   On a related note, since you are trying to use streaming for bootstrapping 
from a fixed source, have you considered using bulk insert or insert (for size 
handling) in batch mode which would sort and write the data once. With this 
mode of incremental inserting, Hudi would try to increase a small file 
generated in the previous batch. This means that it has to read the small file 
and apply new insert and write a newer version of the file (which is bigger). 
As you can see, more the number of iterations here, the more repeated reads 
will happen. Hence, you would benefit by throwing more resources for a 
potentially shorter time to do this migration. 
   

   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-24 Thread GitBox


rubenssoto commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663806475


   I tried resizing the cluster with 3 more nodes, so in total(4 nodes) after 
resizing I had 4 cores in each node and 16gb of ram each, and it wasn't any 
difference, the job keeps very slow and with memory errors.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #349

2020-07-24 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.33 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [hudi] rubenssoto opened a new issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-24 Thread GitBox


rubenssoto opened a new issue #1878:
URL: https://github.com/apache/hudi/issues/1878


   Hi, how are you?
   
   Im using EMR 5.30.1, spark 2.4.5, hudi 0.5.2 and my data is store in S3.
   
   Since today Im trying to migrate some of our datasets in production to 
apache hudi, Im having problems with the first, could you help me please?
   
   It is a small dataset, 26gb distributed by 89 parquet files. Im reading the 
data with structured streaming, reading 4 files per trigger, when I write the 
stream in a regular parquet, works, but if I use hudi doenst work.
   
   This is my hudi options, I tryed with or without shuffle options, I need 
files more than 500mb with max 1000mb
   
   hudi_options = {
 'hoodie.table.name': tableName,
 'hoodie.datasource.write.recordkey.field': 'id',
 'hoodie.datasource.write.partitionpath.field': 'event_date',
 'hoodie.datasource.write.table.name': tableName,
 'hoodie.datasource.write.operation': 'insert',
 'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp',
 'hoodie.datasource.write.hive_style_partitioning': 'true',
 'hoodie.parquet.small.file.limit': 5,
 'hoodie.parquet.max.file.size': 8,
 'hoodie.datasource.hive_sync.enable': 'true',
 'hoodie.datasource.hive_sync.table': tableName,
 'hoodie.datasource.hive_sync.database': 'datalake_raw',
 'hoodie.datasource.hive_sync.partition_fields': 'event_date',
 'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
 'hoodie.datasource.hive_sync.jdbcurl':'jdbc:hive2://:1',
 'hoodie.insert.shuffle.parallelism': 20,
 'hoodie.upsert.shuffle.parallelism': 20
   } 
   
   My read and write functions:
   
   def read_parquet_stream(spark_session, read_folder_path, data_schema, 
max_files_per_trigger):
   spark = spark_session
   df = spark \
   .readStream \
   .option("maxFilesPerTrigger", max_files_per_trigger) \
   .schema(data_schema) \
   .parquet(read_folder_path)
   return df
   
   def write_hudi_dataset_stream(spark_data_frame, checkpoint_location_folder, 
write_folder_path, hudi_options):
   df_write_query = spark_data_frame \
 .writeStream \
 .options(**hudi_options) \
 .trigger(processingTime='20 seconds') \
 .outputMode('append') \
 .format('hudi')\
 .option("checkpointLocation", 
checkpoint_location_folder) \
 .start(write_folder_path)
   df_write_query.awaitTermination()
   
   I caught some errors:
   
   Job aborted due to stage failure: Task 11 in stage 2.0 failed 4 times, most 
recent failure: Lost task 11.3 in stage 2.0 (TID 53, 
ip-10-0-87-171.us-west-2.compute.internal, executor 9): ExecutorLostFailure 
(executor 9 exited caused by one of the running tasks) Reason: Container killed 
by YARN for exceeding memory limits. 6.3 GB of 5.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead or disabling 
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
   My cluster is small, but the data is small too
   master: 4 cores and 16gb ram
   nodes: 2 nodes with 4 cores and 16gb each
   
   If I write stream in a regular parquet takes 38min to finish the job, but in 
hudi it have been passed more then one hour and half and job haven't finished 
yet.
   
   Could you help me? I need to put this job in production as soon as possible.
   
   Thank you Guys!!! 
   
   
   https://user-images.githubusercontent.com/36298331/88447482-27be5000-ce0a-11ea-889a-c5f1042fbe98.png;>
   
   https://user-images.githubusercontent.com/36298331/88447495-4cb2c300-ce0a-11ea-9482-7965d7646476.png;>
   https://user-images.githubusercontent.com/36298331/88447510-81267f00-ce0a-11ea-9311-38a395390d6b.png;>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1874: [MINOR] Use HoodieActiveTimeline.COMMIT_FORMATTER

2020-07-24 Thread GitBox


vinothchandar merged pull request #1874:
URL: https://github.com/apache/hudi/pull/1874


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [MINOR] Use HoodieActiveTimeline.COMMIT_FORMATTER (#1874)

2020-07-24 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0cb24e4  [MINOR] Use HoodieActiveTimeline.COMMIT_FORMATTER (#1874)
0cb24e4 is described below

commit 0cb24e4a2defd8e639437b6cd145a26f038ef1af
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Fri Jul 24 18:48:56 2020 -0700

[MINOR] Use HoodieActiveTimeline.COMMIT_FORMATTER (#1874)
---
 .../java/org/apache/hudi/common/fs/TestFSUtils.java   | 10 +-
 .../apache/hudi/common/model/TestHoodieWriteStat.java |  4 ++--
 .../apache/hudi/common/testutils/HoodieTestUtils.java | 19 +--
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestFSUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestFSUtils.java
index 0e35df5..f1d8078 100644
--- a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestFSUtils.java
+++ b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestFSUtils.java
@@ -37,7 +37,6 @@ import java.io.File;
 import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Paths;
-import java.text.SimpleDateFormat;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Date;
@@ -46,6 +45,7 @@ import java.util.UUID;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.COMMIT_FORMATTER;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertFalse;
 import static org.junit.jupiter.api.Assertions.assertNotNull;
@@ -72,14 +72,14 @@ public class TestFSUtils extends HoodieCommonTestHarness {
 
   @Test
   public void testMakeDataFileName() {
-String instantTime = new SimpleDateFormat("MMddHHmmss").format(new 
Date());
+String instantTime = COMMIT_FORMATTER.format(new Date());
 String fileName = UUID.randomUUID().toString();
 assertEquals(FSUtils.makeDataFileName(instantTime, TEST_WRITE_TOKEN, 
fileName), fileName + "_" + TEST_WRITE_TOKEN + "_" + instantTime + ".parquet");
   }
 
   @Test
   public void testMaskFileName() {
-String instantTime = new SimpleDateFormat("MMddHHmmss").format(new 
Date());
+String instantTime = COMMIT_FORMATTER.format(new Date());
 int taskPartitionId = 2;
 assertEquals(FSUtils.maskWithoutFileId(instantTime, taskPartitionId), "*_" 
+ taskPartitionId + "_" + instantTime + ".parquet");
   }
@@ -144,7 +144,7 @@ public class TestFSUtils extends HoodieCommonTestHarness {
 
   @Test
   public void testGetCommitTime() {
-String instantTime = new SimpleDateFormat("MMddHHmmss").format(new 
Date());
+String instantTime = COMMIT_FORMATTER.format(new Date());
 String fileName = UUID.randomUUID().toString();
 String fullFileName = FSUtils.makeDataFileName(instantTime, 
TEST_WRITE_TOKEN, fileName);
 assertEquals(instantTime, FSUtils.getCommitTime(fullFileName));
@@ -152,7 +152,7 @@ public class TestFSUtils extends HoodieCommonTestHarness {
 
   @Test
   public void testGetFileNameWithoutMeta() {
-String instantTime = new SimpleDateFormat("MMddHHmmss").format(new 
Date());
+String instantTime = COMMIT_FORMATTER.format(new Date());
 String fileName = UUID.randomUUID().toString();
 String fullFileName = FSUtils.makeDataFileName(instantTime, 
TEST_WRITE_TOKEN, fileName);
 assertEquals(fileName, FSUtils.getFileId(fullFileName));
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieWriteStat.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieWriteStat.java
index a01effa..7136ce7 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieWriteStat.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieWriteStat.java
@@ -23,10 +23,10 @@ import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hadoop.fs.Path;
 import org.junit.jupiter.api.Test;
 
-import java.text.SimpleDateFormat;
 import java.util.Date;
 import java.util.UUID;
 
+import static 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.COMMIT_FORMATTER;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNull;
 
@@ -37,7 +37,7 @@ public class TestHoodieWriteStat {
 
   @Test
   public void testSetPaths() {
-String instantTime = new SimpleDateFormat("MMddHHmmss").format(new 
Date());
+String instantTime = COMMIT_FORMATTER.format(new Date());
 String basePathString = "/data/tables/some-hoodie-table";
 String partitionPathString = "2017/12/31";
 String fileName = UUID.randomUUID().toString();
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java
 

[GitHub] [hudi] vinothchandar merged pull request #1877: [MINOR] Add Databricks File System to StorageSchemes

2020-07-24 Thread GitBox


vinothchandar merged pull request #1877:
URL: https://github.com/apache/hudi/pull/1877


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [MINOR] Add Databricks File System to StorageSchemes (#1877)

2020-07-24 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 467d097  [MINOR] Add Databricks File System to StorageSchemes (#1877)
467d097 is described below

commit 467d097dae5d38bfba8e1249ab52fe0f294e6172
Author: Gary Li 
AuthorDate: Fri Jul 24 18:47:09 2020 -0700

[MINOR] Add Databricks File System to StorageSchemes (#1877)
---
 .../src/main/java/org/apache/hudi/common/fs/StorageSchemes.java   | 4 +++-
 .../src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java   | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
index 06b92fd..3e721d1 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/StorageSchemes.java
@@ -49,7 +49,9 @@ public enum StorageSchemes {
   //ALLUXIO
   ALLUXIO("alluxio", false),
   // Tencent Cloud Object Storage
-  COSN("cosn", false);
+  COSN("cosn", false),
+  // Databricks file system
+  DBFS("dbfs", false);
 
   private String scheme;
   private boolean supportsAppend;
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java 
b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java
index 48e4b75..dcb1206 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/fs/TestStorageSchemes.java
@@ -42,6 +42,7 @@ public class TestStorageSchemes {
 assertTrue(StorageSchemes.isAppendSupported("viewfs"));
 assertFalse(StorageSchemes.isAppendSupported("alluxio"));
 assertFalse(StorageSchemes.isAppendSupported("cosn"));
+assertFalse(StorageSchemes.isAppendSupported("dbfs"));
 assertThrows(IllegalArgumentException.class, () -> {
   StorageSchemes.isAppendSupported("s2");
 }, "Should throw exception for unsupported schemes");



[GitHub] [hudi] satishkotha commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-07-24 Thread GitBox


satishkotha commented on issue #1872:
URL: https://github.com/apache/hudi/issues/1872#issuecomment-663788217


   Hi,
   
   Number of 'createMarkerFile' calls = (number of partitions) + (number of 
file groups) *touched* by upsert operation. 
   
   What is the partition for your workload? What is the 
'hoodie.parquet.small.file.limit'?  If you have a lot of small files, then we 
likely need to create a lot of markers (if upsert workload is distributed 
across multiple file groups).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 opened a new pull request #1877: [MINOR] Add Databricks File System to StorageSchemes

2020-07-24 Thread GitBox


garyli1019 opened a new pull request #1877:
URL: https://github.com/apache/hudi/pull/1877


   ## What is the purpose of the pull request
   
   *Add support to databricks file system as a mount point on top of Azure data 
lake*
   
   ## Brief change log
   
   Add dbfs to StorageSchemes
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1876: [HUDI-242] Support for RFC-12/Bootstrapping of external datasets

2020-07-24 Thread GitBox


vinothchandar commented on pull request #1876:
URL: https://github.com/apache/hudi/pull/1876#issuecomment-663787403


   @bvaradar @umehrot2 after many valiant efforts, finally rebased the original 
#1678  here. Will be working on getting the code review comments addressed and 
tests passing over the weekend. 
   Will be then trying to redo #1702 on top of that. 
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar opened a new pull request #1876: [HUDI-242] Support for RFC-12/Bootstrapping of external datasets

2020-07-24 Thread GitBox


vinothchandar opened a new pull request #1876:
URL: https://github.com/apache/hudi/pull/1876


- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
- [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
- [HUDI-424] Implement Query Side Integration for querying tables 
containing bootstrap file slices
- [HUDI-423] Implement upsert functionality for handling updates to these 
bootstrap file slices
- [HUDI-421] Bootstrap Write Client with tests
- [HUDI-425] Added HoodieDeltaStreamer support
- [HUDI-899] Add a knob to change partition-path style while performing 
metadata bootstrap
- [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys 
correctly
- [HUDI-424] Simplify Record reader implementation
- [HUDI-423] Implement upsert functionality for handling updates to these 
bootstrap file slices
- [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI 
working with bootstrap tables
   
   Co-authored-by: Mehrotra 
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-24 Thread GitBox


vinothchandar commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r460343368



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -151,6 +154,27 @@ public HoodieTableType getTableType() {
 : Option.empty();
   }
 
+  /**
+   * @return the table version from .hoodie properties file.
+   */
+  public HoodieTableVersion getHoodieTableVersionFromPropertyFile() {
+if (props.contains(HOODIE_TABLE_VERSION_PROP_NAME)) {
+  String propValue = props.getProperty(HOODIE_TABLE_VERSION_PROP_NAME);
+  if (propValue.equals(HoodieTableVersion.ZERO_SIX_ZERO.version)) {
+return HoodieTableVersion.ZERO_SIX_ZERO;
+  }
+}
+return DEFAULT_TABLE_VERSION;
+  }
+
+  /**
+   * @return the current hoodie table version.
+   */
+  public HoodieTableVersion getCurrentHoodieTableVersion() {
+// TODO: fetch current version dynamically

Review comment:
   `HoodieTableVersion` or someplace we need to ahve a `CURR_VERSION` 
variable that gets bumped to 0.6.1 . 
   
   More I think about this. I think its better to name the versions 0,1,2... 
and so on, instead of release numbers. we may not bump this up every release . 
only when upgrade/downgrade is necessary. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] FelixKJose opened a new issue #1875: EMR + Spark Batch job + HUDI + Hive external Metastore (MySQL RDS Instance) failed with No Suitable Driver

2020-07-24 Thread GitBox


FelixKJose opened a new issue #1875:
URL: https://github.com/apache/hudi/issues/1875


   Hello,
   
   I am getting following error while I am using external RDS instance as Hive 
Metastore. 
   
   **My configuration:**
   
   
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database': 'hive_metastore',
   'hoodie.datasource.hive_sync.table': 'calculations',
   'hoodie.datasource.hive_sync.username': 'spark',
   'hoodie.datasource.hive_sync.password': 'password123',
   'hoodie.datasource.hive_sync.jdbcurl': 
'jdbc:mysql://***.us-east-1.rds.amazonaws.com:3306'
   
   Error Stacktrace:
   `org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection 
jdbc:mysql://**.us-east-1.rds.amazonaws.com:3306/
at 
org.apache.hudi.hive.HoodieHiveClient.createHiveConnection(HoodieHiveClient.java:559)
at 
org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:108)
at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:60)
at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:236)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:84)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
   **Caused by: java.sql.SQLException: No suitable driver found for 
jdbc:mysql://**.us-east-1.rds.amazonaws.com:3306**
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at 
org.apache.hudi.hive.HoodieHiveClient.createHiveConnection(HoodieHiveClient.java:556)
... 35 more`
   
   
   **Environment Description**
* EMR: 6.0.0
   
   * Hudi version : Custom HUDI Jar (provided by **Udit Mehrotra** for EMR 
6.0.0 with performance fixes)
   
   * Spark version : 2.4.4
   
   * Storage (HDFS/S3/GCS..) : S3
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


satishkotha commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663751174


   Sounds good. Please try it and let me know if you see any issues.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is 
resulting in a noop. I will continue to monitor.
   
   clean and archival are somewhat independent today. So this 'noop' should not 
happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job 
concurrently while streaming write is happening, guess it should be fine as 
compaction runs have that ability
   
   Why are you considering separate spark job for clean? Are you seeing clean 
take a lot of time? You can consider running clean concurrently with write by 
setting 'hoodie.clean.async' to true. (This runs clean in same job, but 
concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, 
I think it is possible. But you may have to do some testing because it isn't 
used like this afaik.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luffyd edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


luffyd edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663741729


   Ok thanks
   No I was not thinking to run as separate process continuously, but I wanted 
to execute "clean commands" from cli o that my streaming tests progress faster.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luffyd commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


luffyd commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663741729


   Ok thanks, I will be running "clean commands" from hudi cli so that my tests 
progress faster for streaming.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


umehrot2 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663734319


   Also on a side note, we always recommend using latest EMR releases as it has 
latest fixes and version of applications. So you may want to use `emr-5.30.1` 
instead.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


umehrot2 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663732545


   @tsolanki95 have you tried using `hoodie.consistency.check.enabled` which is 
Hudi's in-built mechanism for avoiding `eventual consistency` issues instead ?
   
   As for this particular issue with `EmrFS consistent view` are these 
temporary errors which resolve on retrying or is it causing the job to fail ? 
Yes disabling `fs.s3.consistent.metadata.etag.verification.enabled` could be a 
way ahead if this is blocking you while EMR team can try investigating this 
issue.
   
   cc @bschell who actually worked on the etag feature in EmrFS. Do you see any 
obvious cause for this ? Else, we can possibly have them open a ticket to AWS 
EMR support and investigate from there.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan opened a new pull request #1874: [MINOR] Use HoodieActiveTimeline.COMMIT_FORMATTER

2020-07-24 Thread GitBox


xushiyan opened a new pull request #1874:
URL: https://github.com/apache/hudi/pull/1874


   To avoid repeated datetime format "MMddHHmmss".
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on a change in pull request #1873: [HUDI-995] Move TestRawTripPayload and HoodieTestDataGenerator to hudi-common

2020-07-24 Thread GitBox


xushiyan commented on a change in pull request #1873:
URL: https://github.com/apache/hudi/pull/1873#discussion_r460283968



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/testutils/RawTripTestPayload.java
##
@@ -141,59 +138,4 @@ private String unCompressData(byte[] data) throws 
IOException {
 }
   }
 
-  /**
-   * A custom {@link WriteStatus} that merges passed metadata key value map to 
{@code WriteStatus.markSuccess()} and
-   * {@code WriteStatus.markFailure()}.
-   */
-  public static class MetadataMergeWriteStatus extends WriteStatus {

Review comment:
   moved this to a separate class file; it needs to stay in hudi-client





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on a change in pull request #1873: [HUDI-995] Move TestRawTripPayload and HoodieTestDataGenerator to hudi-common

2020-07-24 Thread GitBox


xushiyan commented on a change in pull request #1873:
URL: https://github.com/apache/hudi/pull/1873#discussion_r460283968



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/testutils/RawTripTestPayload.java
##
@@ -141,59 +138,4 @@ private String unCompressData(byte[] data) throws 
IOException {
 }
   }
 
-  /**
-   * A custom {@link WriteStatus} that merges passed metadata key value map to 
{@code WriteStatus.markSuccess()} and
-   * {@code WriteStatus.markFailure()}.
-   */
-  public static class MetadataMergeWriteStatus extends WriteStatus {

Review comment:
   move this to a separate class; it needs to stay in hudi-client





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-995) Organize test utils methods and classes

2020-07-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-995:

Labels: pull-request-available  (was: )

> Organize test utils methods and classes
> ---
>
> Key: HUDI-995
> URL: https://issues.apache.org/jira/browse/HUDI-995
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>
> * Move test utils classes to hudi-common where appropriate, e.g. 
> TestRawTripPayload, HoodieDataGenerator
>  * Organize test utils into separate utils classes like `TransformUtils` for 
> transformations, `SchemaUtils` for schema loading, etc
>  *



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan opened a new pull request #1873: [HUDI-995] Move TestRawTripPayload and HoodieTestDataGenerator to hudi-common

2020-07-24 Thread GitBox


xushiyan opened a new pull request #1873:
URL: https://github.com/apache/hudi/pull/1873


   To allow wider access to these classes
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-24 Thread GitBox


umehrot2 commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-663728553


   @bvaradar EMR only overrides the `getLen()` if the customer has explicitly 
enabled `Client Side Encryption` using the EmrFS property `fs.s3.cse.enabled`. 
In that case I see that EmrFS needs to make a couple of `S3 calls`. But, based 
on my brief conversation with @zuyanton he mentioned he is most likely not 
enabling this feature. But I would let him confirm this, and if its true EMR 
team can further look into the possibility of any optimizations in that code 
path.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tsolanki95 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


tsolanki95 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663698781


   This is also a field where data quality, precision, and accuracy are 
important. EMRFS consistent view helps us keep us having issues with s3 
consistency, some of the features that hudi provides with rollback 
capabilities, and auditing and tracking changes made to our table are 
incredibly powerful for helping find and isolate data quality errors and 
rollback and rerun data with fixed input data/code.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tsolanki95 edited a comment on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


tsolanki95 edited a comment on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663688633


   @luffyd We put in consistent view as a solution earlier, based on AWS 
support, to solve issues with using spark with S3 eventual consistency model 
causing duplicates in our data. We are now looking towards changing some of our 
datasets to utilize hudi but our compute resources still utilize EMRFS 
consistent view. As part of the transition, when some of our datasets utilize 
hudi and some do not, it would be good to be able to run spark with hudi on 
EMRFS consistent view.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luffyd commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-07-24 Thread GitBox


luffyd commented on issue #1872:
URL: https://github.com/apache/hudi/issues/1872#issuecomment-663690493


   I have noticed slowing down ingestion worked.
   It seems like number of calls to "HoodieWriteHandle.createMarkerFile" is 
resulting an S3 call.
   But can you give any hints on 
   1. how number calls to "HoodieWriteHandle.createMarkerFile" is related to 
number of partitions
   2. how number calls to "HoodieWriteHandle.createMarkerFile" is related to 
number of files in a partition
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asheeshgarg commented on issue #1787: Exception During Insert

2020-07-24 Thread GitBox


asheeshgarg commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-663688727


   @bvaradar I am getting the same exception I had added the jars to the --jars 
option of submit so its available to both driver and executors.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tsolanki95 edited a comment on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


tsolanki95 edited a comment on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663688633


   @luffyd We put in consistent view as a solution earlier, based on AWS 
support, to solve issues with using spark with S3 eventual consistency model. 
We are now looking towards changing some of our datasets to utilize hudi but 
our compute resources still utilize EMRFS consistent view. As part of the 
transition, when some of our datasets utilize hudi and some do not, it would be 
good to be able to run spark with hudi on EMRFS consistent view.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tsolanki95 commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


tsolanki95 commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663688633


   We put in consistent view as a solution earlier, based on AWS support, to 
solve issues with using spark with S3 eventual consistency model. We are now 
looking towards changing some of our datasets to utilize hudi but our compute 
resources still utilize EMRFS consistent view. As part of the transition, when 
some of our datasets utilize hudi and some do not, it would be good to be able 
to run spark with hudi on EMRFS consistent view.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-07-24 Thread GitBox


satishkotha commented on issue #1872:
URL: https://github.com/apache/hudi/issues/1872#issuecomment-663688088


   This is likely more of AWS support question. A quick search shows 
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-503-slow-down/ 
   
   Can you see if any of the solutions there work for you? You may have to slow 
down ingestion. 
   
   (I dont have a lot of experience with AWS EMR. Others in the community, 
please comment if you have worked around similar problem before)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is 
resulting in a noop. I will continue to monitor.
   
   clean and archival are somewhat independent today. So this 'noop' should not 
happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job 
concurrently while streaming write is happening, guess it should be fine as 
compaction runs have that ability
   Why are you considering separate spark job for clean? Are you seeing clean 
take a lot of time? You can consider running clean concurrently with write by 
setting 'hoodie.clean.async' to true. (This runs clean in same job, but 
concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, 
I think it is possible. But you may have to do some testing because it isn't 
used like this afaik.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


satishkotha commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is 
resulting in a noop. I will continue to monitor.
   clean and archival are somewhat independent. So noop should not happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job 
concurrently while streaming write is happening, guess it should be fine as 
compaction runs have that ability
   Why are you considering separate spark job for clean? Are you seeing clean 
take a lot of time? You can consider running clean concurrently with write by 
setting 'hoodie.clean.async' to true. (This runs clean in same job, but 
concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, 
I think it is possible. But you may have to do some testing because it isn't 
used like this afaik.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha edited a comment on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


satishkotha edited a comment on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663683323


   > Is there a possibility that commits get archived before clean job is 
resulting in a noop. I will continue to monitor.
   
   clean and archival are somewhat independent. So noop should not happen.
   
   > Also can you confirm If I can run a clean job in a separate spark job 
concurrently while streaming write is happening, guess it should be fine as 
compaction runs have that ability
   Why are you considering separate spark job for clean? Are you seeing clean 
take a lot of time? You can consider running clean concurrently with write by 
setting 'hoodie.clean.async' to true. (This runs clean in same job, but 
concurrently with write). 
   
   I don't know of anyone using separate spark job to run clean. Theoretically, 
I think it is possible. But you may have to do some testing because it isn't 
used like this afaik.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1787: Exception During Insert

2020-07-24 Thread GitBox


bvaradar commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-663677973


   @asheeshgarg : I may have accidentally deleted a comment from. Has the issue 
been resolved ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

2020-07-24 Thread GitBox


bvaradar closed issue #1856:
URL: https://github.com/apache/hudi/issues/1856


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

2020-07-24 Thread GitBox


bvaradar commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-663677430


   Please reopen if you need further clarifications.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1864: Spark 2.2.0 is compatible?

2020-07-24 Thread GitBox


bvaradar commented on issue #1864:
URL: https://github.com/apache/hudi/issues/1864#issuecomment-663677139


   Closing this ticket. Please reach out in slack or open a new ticket if you 
find any issues



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1864: Spark 2.2.0 is compatible?

2020-07-24 Thread GitBox


bvaradar closed issue #1864:
URL: https://github.com/apache/hudi/issues/1864


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luffyd commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


luffyd commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663659622


   @tsolanki95 Does this happen at the time read? In my tests, I noticed etags 
are not being in sync for .hoodie folder.
   Also what are your reasons to enable consistent view when using hudi.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luffyd opened a new issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-07-24 Thread GitBox


luffyd opened a new issue #1872:
URL: https://github.com/apache/hudi/issues/1872


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   My setup has 1000 partitions and 24Billion records which was created via 
bulk insert, I am running a test with 3M(million) new records and 9M updates. 
So in total 12M upserts
   
   I keep getting 503s, when there were 100 partitions. So I increased number 
of partitions to get around s3 503 throttles. But seems it is not the issue.
   
   Can you help how to debug this further? I am trying to reduce the amount to 
writes. But want to understand what exactly is the bottle neck in-terms of S3 
activity(most often I see a problem with GetObjectMetadataCall throttling)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create 24B records with 1000 patirtions
   2. I have 25 retries configured on S3 throttles, I was hoping it would 
process late but not throw FATAL
   ```
   config.set("spark.hadoop.fs.s3.maxRetries", "25")
   config.set("spark.hadoop.fs.s3.sleepTimeSeconds", "60")
   ```
   3. Have 12M upserts(1:3 insert to upsert ratio) continuously
   
   **Expected behavior**
   
   I was expecting upsert to happen smoothly
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Looking at the stack trace, my thoughts are there is so much S3 activity 
happening to create and maintain markers and guessed increasing partitions 
should have helped but it made things worse from my observation, 1000 
partitions one is performing bad than 100 partitions data set.
   
   **Stacktrace**
   
   ```Exception in thread "main" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 40822 in stage 53.0 failed 4 times, most recent 
failure: Lost task 40822.3 in stage 53.0 (TID 376598, 
ip-10-0-1-217.us-west-2.compute.internal, executor 69): org.apache.hudi
   .exception.HoodieUpsertException: Error upserting bucketType UPDATE for 
partition :40822
   at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:253)
   at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
   at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
   at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
   at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
   at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
   at 
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
   at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:123)
   at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
   at 

[jira] [Assigned] (HUDI-995) Organize test utils methods and classes

2020-07-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-995:
---

Assignee: Raymond Xu

> Organize test utils methods and classes
> ---
>
> Key: HUDI-995
> URL: https://issues.apache.org/jira/browse/HUDI-995
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>
> * Move test utils classes to hudi-common where appropriate, e.g. 
> TestRawTripPayload, HoodieDataGenerator
>  * Organize test utils into separate utils classes like `TransformUtils` for 
> transformations, `SchemaUtils` for schema loading, etc
>  *



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-995) Organize test utils methods and classes

2020-07-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-995:

Description: 
* Move test utils classes to hudi-common where appropriate, e.g. 
TestRawTripPayload, HoodieDataGenerator
 * Organize test utils into separate utils classes like `TransformUtils` for 
transformations, `SchemaUtils` for schema loading, etc
 *

  was:
* add a new module {{hudi-testutils}} and add it to all other modules as test 
dep and remove {{hudi-common}} etc from test dep list
 * selectively migrate test util classes like data gen to {{hudi-testutils}}
 * provide utils to be able generalize base file/log file style testing.


> Organize test utils methods and classes
> ---
>
> Key: HUDI-995
> URL: https://issues.apache.org/jira/browse/HUDI-995
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> * Move test utils classes to hudi-common where appropriate, e.g. 
> TestRawTripPayload, HoodieDataGenerator
>  * Organize test utils into separate utils classes like `TransformUtils` for 
> transformations, `SchemaUtils` for schema loading, etc
>  *



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-995) Organize test utils methods and classes

2020-07-24 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-995:

Summary: Organize test utils methods and classes  (was: Add hudi-testutils 
module)

> Organize test utils methods and classes
> ---
>
> Key: HUDI-995
> URL: https://issues.apache.org/jira/browse/HUDI-995
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> * add a new module {{hudi-testutils}} and add it to all other modules as test 
> dep and remove {{hudi-common}} etc from test dep list
>  * selectively migrate test util classes like data gen to {{hudi-testutils}}
>  * provide utils to be able generalize base file/log file style testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] luffyd commented on issue #1866: [SUPPORT]Clean up does not seem to happen on MOR table

2020-07-24 Thread GitBox


luffyd commented on issue #1866:
URL: https://github.com/apache/hudi/issues/1866#issuecomment-663648589


   Thanks saitsh,
   I have inline turned on by default, Now I see cleans did happen! Is there a 
possibility that commits get archived before clean job is resulting in a noop. 
I will continue to monitor. 
   
   Also can you confirm If I can run a clean job in a separate spark job 
concurrently while streaming write is happening, guess it should be fine as 
compaction runs have that ability
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ssomuah commented on issue #1852: [SUPPORT]

2020-07-24 Thread GitBox


ssomuah commented on issue #1852:
URL: https://github.com/apache/hudi/issues/1852#issuecomment-663646201


   Hi Balaji, I think I've narrowed down my issue somewhat for my MOR table. 
   
   I started again with a fresh table and the initial commits make sense, but 
after a time I notice It's consistently trying to write 300+ files. 
   
   https://user-images.githubusercontent.com/2061955/88417393-da14f980-cdaf-11ea-87ab-63f3aafade83.png;>
   
   https://user-images.githubusercontent.com/2061955/88417402-de411700-cdaf-11ea-85dd-c10c405851d3.png;>
   
   https://user-images.githubusercontent.com/2061955/88417424-e5682500-cdaf-11ea-9c4b-534e27d80c45.png;>
   
   
   The individual tasks don't take that long so I think if I could reduce the 
number of files it's trying to write it would help. 
   https://user-images.githubusercontent.com/2061955/88417487-fca71280-cdaf-11ea-9fc0-10a8a074501c.png;>
   
   
   I can also see from the cli that whether it's doing a compaction or a delta 
commit I still seem to be writing the same number of files for a fraction of 
the data. 
   https://user-images.githubusercontent.com/2061955/88417841-aa1a2600-cdb0-11ea-808f-d66595af91ea.png;>
   
   
   Is there something I can tune to reduce the number of files it breaks the 
data into?
   
   hoodie.logfile.max.size is 256MB
   hoodie.parquet.max.file.size is 256MB
   hoodie.parquet.compression.ratio is the default .35



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-24 Thread GitBox


bvaradar commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-663450319


   @bschell : Thanks for the information. As getLen() is used extensively both 
on read and write side, can you let us elaborate more on what cases does it 
actually result in RPC calls ? Is there an ability to cache within the 
implementation ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on pull request #1810: [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync

2020-07-24 Thread GitBox


lw309637554 commented on pull request #1810:
URL: https://github.com/apache/hudi/pull/1810#issuecomment-663442107


   > couple
   
   okay ,thanks 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1852: [SUPPORT]

2020-07-24 Thread GitBox


bvaradar commented on issue #1852:
URL: https://github.com/apache/hudi/issues/1852#issuecomment-663427905


   What do you mean by "runs serially with ingestion"? My understanding was 
that inline compaction happened in the same flow as writing so an inline 
compaction would simply slow down ingestion.
   
===> Yes, that is what I meant. Inline Compaction would run after ingestion 
but not in parallel. You can use #1752 to have it run concurrently.
   
   Does INLINE_COMPACT_NUM_DELTA_COMMITS_PROP refer to the number of commits 
retained in general, or the number of commits for a record?
   
   ==> INLINE_COMPACT_NUM_DELTA_COMMITS_PROP refers to number of ingestion 
(deltacommits) between 2 compaction runs. 
   
   I see in the timeline I have several clean.requested and clean.inflight, how 
can I get these to actually complete?
   
   ==> If it is in inflight state alone, there could be errors when Hudi is 
trying to cleanup. Please look for exceptions in driver logs. Cleaner run 
should be run automatically by default. Also, any pending clean operations will 
automatically get picked up in next ingestion. So, it must have been failing 
for some reasons. You can turn on logs to see what is happening.
   
   Is it possible to force a compaction of the existing log files.
   
   ===> Yes, by configuring INLINE_COMPACT_NUM_DELTA_COMMITS_PROP. You can set 
it to 1 to have aggressive compaction. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1867: [SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view

2020-07-24 Thread GitBox


bvaradar commented on issue #1867:
URL: https://github.com/apache/hudi/issues/1867#issuecomment-663414676


   @umehrot2 : Can you help answer this question. Thanks.
   Balaji.V



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1860: [SUPPORT] Issue when querying from Spark Datasource if COW table is being written to at the same time

2020-07-24 Thread GitBox


bvaradar commented on issue #1860:
URL: https://github.com/apache/hudi/issues/1860#issuecomment-663413173


   I would expect the data to be same across query engines unless there is some 
caching or GS is not giving consistent listing view.
   
   With Hudi's Spark datasource integration, Hudi reuses spark's parquet Data 
Source implementation and merely applies file level path filter to pick and 
choose what files to read. you can do something like 
select(distinct("_hoodie_file_name")) on both the cases to see if any file is 
getting missed. You can also run select(max("_hoodie_commit_time") to determine 
what is the highest committed time and check if they are consistent for 
checking atomicity. Otherwise, I suggest you can also do similar experiments 
with Parquet or other datasets. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sbernauer commented on issue #1845: [SUPPORT] Support for Schema evolution. Facing an error

2020-07-24 Thread GitBox


sbernauer commented on issue #1845:
URL: https://github.com/apache/hudi/issues/1845#issuecomment-663373241


   > 4. We ingest old events again (there are some upserts). ?? What schema 
is being used here?
   
   At this step I used SCHEMA_V2
   We use Deltastreamer in continues mode and only restart it in step 2, where 
we provide the new SCHEMA_V2 to the Deltastreamer.
   I tried to reproduce everything as good as possible in my 
[DeltaStreamer-test](https://github.com/apache/hudi/pull/1844/files#diff-2c3763c5782af9c3cbc02e2935211587R476)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1123) Document the usage of user define metrics reporter

2020-07-24 Thread Zheren Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164179#comment-17164179
 ] 

Zheren Yu commented on HUDI-1123:
-

@leesf

Thank you for assigning 

> Document the usage of user define metrics reporter
> --
>
> Key: HUDI-1123
> URL: https://issues.apache.org/jira/browse/HUDI-1123
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Zheren Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1124) Document the usage of Tencent COSN

2020-07-24 Thread deyzhong (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164181#comment-17164181
 ] 

deyzhong commented on HUDI-1124:


ok, I will finish the work as soon as possible.

 

> Document the usage of Tencent COSN
> --
>
> Key: HUDI-1124
> URL: https://issues.apache.org/jira/browse/HUDI-1124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: deyzhong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1124) Document the usage of Tencent COSN

2020-07-24 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164176#comment-17164176
 ] 

leesf commented on HUDI-1124:
-

[~meimile] Assign the ticket to you and feel free to open a new PR, thanks

> Document the usage of Tencent COSN
> --
>
> Key: HUDI-1124
> URL: https://issues.apache.org/jira/browse/HUDI-1124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: deyzhong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1124) Document the usage of Tencent COSN

2020-07-24 Thread leesf (Jira)
leesf created HUDI-1124:
---

 Summary: Document the usage of Tencent COSN
 Key: HUDI-1124
 URL: https://issues.apache.org/jira/browse/HUDI-1124
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: leesf
Assignee: deyzhong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1123) Document the usage of user define metrics reporter

2020-07-24 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164174#comment-17164174
 ] 

leesf commented on HUDI-1123:
-

[~york831] . Assign the ticket to you and feel free to open a new PR, thanks

> Document the usage of user define metrics reporter
> --
>
> Key: HUDI-1123
> URL: https://issues.apache.org/jira/browse/HUDI-1123
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Zheren Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1123) Document the usage of user define metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-1123:
---

Assignee: Zheren Yu

> Document the usage of user define metrics reporter
> --
>
> Key: HUDI-1123
> URL: https://issues.apache.org/jira/browse/HUDI-1123
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Zheren Yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1123) Document the usage of user define metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-1123:

Description: (was: [~york831] . Assign the ticket to you and feel free 
to open a new PR, thanks)

> Document the usage of user define metrics reporter
> --
>
> Key: HUDI-1123
> URL: https://issues.apache.org/jira/browse/HUDI-1123
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1123) Document the usage of user define metrics reporter

2020-07-24 Thread leesf (Jira)
leesf created HUDI-1123:
---

 Summary: Document the usage of user define metrics reporter
 Key: HUDI-1123
 URL: https://issues.apache.org/jira/browse/HUDI-1123
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: leesf


[~york831] . Assign the ticket to you and feel free to open a new PR, thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1113) Support user defined metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-1113.
---

> Support user defined metrics reporter
> -
>
> Key: HUDI-1113
> URL: https://issues.apache.org/jira/browse/HUDI-1113
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Zheren Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Now metrics reporter only support datadog, jmx, Graphite, once user want to 
> add their own metrics it will be difficult(our team is using new-relic), 
> also, not everyone want that dependencies they don't want added in hudi 
> components. So I suggest to having a user defined metrics reporter will be 
> better to monitoring the metrics everywhere



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1113) Support user defined metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-1113.
-
Resolution: Fixed

> Support user defined metrics reporter
> -
>
> Key: HUDI-1113
> URL: https://issues.apache.org/jira/browse/HUDI-1113
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Zheren Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Now metrics reporter only support datadog, jmx, Graphite, once user want to 
> add their own metrics it will be difficult(our team is using new-relic), 
> also, not everyone want that dependencies they don't want added in hudi 
> components. So I suggest to having a user defined metrics reporter will be 
> better to monitoring the metrics everywhere



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1113) Support user defined metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-1113:

Fix Version/s: (was: 0.5.3)
   0.60

> Support user defined metrics reporter
> -
>
> Key: HUDI-1113
> URL: https://issues.apache.org/jira/browse/HUDI-1113
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Zheren Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.60
>
>
> Now metrics reporter only support datadog, jmx, Graphite, once user want to 
> add their own metrics it will be difficult(our team is using new-relic), 
> also, not everyone want that dependencies they don't want added in hudi 
> components. So I suggest to having a user defined metrics reporter will be 
> better to monitoring the metrics everywhere



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1113) Support user defined metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-1113:

Fix Version/s: (was: 0.60)
   0.6.0

> Support user defined metrics reporter
> -
>
> Key: HUDI-1113
> URL: https://issues.apache.org/jira/browse/HUDI-1113
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Zheren Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Now metrics reporter only support datadog, jmx, Graphite, once user want to 
> add their own metrics it will be difficult(our team is using new-relic), 
> also, not everyone want that dependencies they don't want added in hudi 
> components. So I suggest to having a user defined metrics reporter will be 
> better to monitoring the metrics everywhere



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1113) Support user defined metrics reporter

2020-07-24 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-1113:

Status: Open  (was: New)

> Support user defined metrics reporter
> -
>
> Key: HUDI-1113
> URL: https://issues.apache.org/jira/browse/HUDI-1113
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Zheren Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Now metrics reporter only support datadog, jmx, Graphite, once user want to 
> add their own metrics it will be difficult(our team is using new-relic), 
> also, not everyone want that dependencies they don't want added in hudi 
> components. So I suggest to having a user defined metrics reporter will be 
> better to monitoring the metrics everywhere



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan opened a new pull request #1871: [WIP] [HUDI-781] Introduce HoodieDataPrep for test preparation

2020-07-24 Thread GitBox


xushiyan opened a new pull request #1871:
URL: https://github.com/apache/hudi/pull/1871


   - Consolidate relevant util methods to `HoodieDataPrep`
   - Make `HoodieDataPrep` the sole class for creating hoodie data/metadata 
files for testing
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org