[GitHub] [incubator-hudi] bvaradar commented on issue #927: [HUDI-279] Fix regression in Schema Evolution due to PR-755

2019-09-25 Thread GitBox
bvaradar commented on issue #927: [HUDI-279] Fix regression in Schema Evolution 
due to PR-755
URL: https://github.com/apache/incubator-hudi/pull/927#issuecomment-535348602
 
 
   Added Jira : HUDI-284
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar merged pull request #927: [HUDI-279] Fix regression in Schema Evolution due to PR-755

2019-09-25 Thread GitBox
bvaradar merged pull request #927: [HUDI-279] Fix regression in Schema 
Evolution due to PR-755
URL: https://github.com/apache/incubator-hudi/pull/927
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-279] Fix regression in Schema Evolution due to PR-755

2019-09-25 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2ea8b0c  [HUDI-279] Fix regression in Schema Evolution due to PR-755
2ea8b0c is described below

commit 2ea8b0c3f1eeb19f4dc1e9946331c8fd93e6daab
Author: Balaji Varadarajan 
AuthorDate: Wed Sep 25 06:20:56 2019 -0700

[HUDI-279] Fix regression in Schema Evolution due to PR-755
---
 .../apache/hudi/client/embedded/EmbeddedTimelineService.java   |  2 +-
 .../test/java/org/apache/hudi/func/TestUpdateMapFunction.java  |  7 ---
 .../java/org/apache/hudi/common/SerializableConfiguration.java |  8 ++--
 .../org/apache/hudi/common/table/HoodieTableMetaClient.java|  2 +-
 .../apache/hudi/common/table/view/FileSystemViewManager.java   | 10 +-
 .../java/org/apache/hudi/utilities/HoodieSnapshotCopier.java   |  4 ++--
 6 files changed, 19 insertions(+), 14 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
 
b/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
index 4c6089c..46247c1 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
@@ -68,7 +68,7 @@ public class EmbeddedTimelineService {
   }
 
   public void startServer() throws IOException {
-server = new TimelineService(0, viewManager, hadoopConf.get());
+server = new TimelineService(0, viewManager, hadoopConf.newCopy());
 serverPort = server.startService();
 logger.info("Started embedded timeline server at " + hostAddr + ":" + 
serverPort);
   }
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/func/TestUpdateMapFunction.java 
b/hudi-client/src/test/java/org/apache/hudi/func/TestUpdateMapFunction.java
index 74a908d..db986de 100644
--- a/hudi-client/src/test/java/org/apache/hudi/func/TestUpdateMapFunction.java
+++ b/hudi-client/src/test/java/org/apache/hudi/func/TestUpdateMapFunction.java
@@ -29,6 +29,7 @@ import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hudi.HoodieClientTestHarness;
 import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.SerializableConfiguration;
 import org.apache.hudi.common.TestRawTripPayload;
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
@@ -118,9 +119,9 @@ public class TestUpdateMapFunction extends 
HoodieClientTestHarness {
 
   try {
 HoodieMergeHandle mergeHandle = new HoodieMergeHandle(config2, "101", 
table2, updateRecords.iterator(), fileId);
-Configuration conf = new Configuration();
-AvroReadSupport.setAvroReadSchema(conf, mergeHandle.getWriterSchema());
-List oldRecords = ParquetUtils.readAvroRecords(conf,
+SerializableConfiguration conf = new SerializableConfiguration(new 
Configuration());
+AvroReadSupport.setAvroReadSchema(conf.get(), 
mergeHandle.getWriterSchema());
+List oldRecords = 
ParquetUtils.readAvroRecords(conf.get(),
 new Path(config2.getBasePath() + "/" + 
insertResult.getStat().getPath()));
 for (GenericRecord rec : oldRecords) {
   mergeHandle.write(rec);
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/SerializableConfiguration.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/SerializableConfiguration.java
index 0d5dc6c..8f6f0ba 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/SerializableConfiguration.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/SerializableConfiguration.java
@@ -33,13 +33,17 @@ public class SerializableConfiguration implements 
Serializable {
   }
 
   public SerializableConfiguration(SerializableConfiguration configuration) {
-this.configuration = configuration.get();
+this.configuration = configuration.newCopy();
   }
 
-  public Configuration get() {
+  public Configuration newCopy() {
 return new Configuration(configuration);
   }
 
+  public Configuration get() {
+return configuration;
+  }
+
   private void writeObject(ObjectOutputStream out) throws IOException {
 out.defaultWriteObject();
 configuration.write(out);
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
index 479db69..e0c30be 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
@@ -202,7 +202,7 @@ public class HoodieTableMetaClient implements Serializable {
*/
   public HoodieWrapperFileSystem getFs() {
 if (fs == null) {
-  FileSystem fileSystem

[jira] [Created] (HUDI-284) Need Tests for Hudi handling of schema evolution

2019-09-25 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-284:
---

 Summary: Need  Tests for Hudi handling of schema evolution
 Key: HUDI-284
 URL: https://issues.apache.org/jira/browse/HUDI-284
 Project: Apache Hudi (incubating)
  Issue Type: Test
  Components: Common Core
Reporter: Balaji Varadarajan


Context in : 
https://github.com/apache/incubator-hudi/pull/927#pullrequestreview-293449514



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #49

2019-09-25 Thread Apache Jenkins Server
See 


--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on H40 (ubuntu xenial) in workspace 

No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone
Cloning repository https://github.com/apache/incubator-hudi.git
 > git init  # 
 > timeout=10
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Could not init 

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$5.execute(CliGitAPIImpl.java:813)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:605)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to 
H40
at 
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at 
hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor1084.invoke(Unknown 
Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy135.execute(Unknown Source)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1152)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1192)
at hudson.scm.SCM.checkout(SCM.java:504)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1810)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at 
hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Caused by: hudson.plugins.git.GitException: Error performing git command
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2051)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2010)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2006)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1638)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$5.execute(CliGitAPIImpl.java:811)
... 11 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at hudson.Proc$LocalProc.(Proc.java:281)
at hudson.Proc$LocalProc.(Proc.java:218)
at hudson.Launcher$LocalLauncher.launch(Launcher.java:936)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2038)
... 15 more
ERROR: Error cloning remote repo 'origin'
Retrying after 10 seconds
No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone
Cloning repository https://github.com/apache/incubator-hudi.git
 > git init  # 
 > timeout=10
ERROR: Error cloning remote repo 'or

[GitHub] [incubator-hudi] yanghua commented on issue #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-25 Thread GitBox
yanghua commented on issue #923: HUDI-247 Unify the initialization of 
HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#issuecomment-535315672
 
 
   @vinothchandar I have two motivation to refactor the `new 
HoodieTableMetaClient()` style code:
   
   * `HoodieTableMetaClient` uses Hadoop File System, the original 
implementation did not explicitly release it (call `fs.close()`) while just 
wait for GC to recycle these objects, it is a bad way. we need to release 
expensive resources as soon as possible.
   * Reuse existed object is a better way. Our essential purpose is to clear 
and initialize the inner state of a hoodie table meta client again. Reuse the 
client object can provide better performance, less object creation, GC 
frequency. A `reloadMetaClient` or `reInitMetaClient` method has a better 
expressiveness in a context than re-initialize an instance of 
`HoodieTableMetaClient`. Actually, when I see this code style firstly, I feel 
strange, why we always re-initialize the object so many times? Are they 
redundant or unnecessary? This is my first impression.
   
   About your concern, IMHO, we may not worry about it. Since the developers 
remember to re-initialize an object they should remember to invoke the 
`reloadMetaClient` method to achieve the same purpose. They know their 
motivation, now, they only need to know there is another method can replace the 
old style to achieve this, and at the same time release the resource ASAP 
correctly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-256) Translate Comparison page

2019-09-25 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-256.
--
Resolution: Fixed

Fixed via asf-site: cef57691228de09429cd8794117dee6fc8f729d2

> Translate Comparison page
> -
>
> Key: HUDI-256
> URL: https://issues.apache.org/jira/browse/HUDI-256
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: docs-chinese
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The online HTML web page: [https://hudi.apache.org/comparison.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #923: HUDI-247 Unify the initialization of HoodieTableMetaClient in test for hoodie-client module

2019-09-25 Thread GitBox
vinothchandar commented on a change in pull request #923: HUDI-247 Unify the 
initialization of HoodieTableMetaClient in test for hoodie-client module
URL: https://github.com/apache/incubator-hudi/pull/923#discussion_r328408448
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java
 ##
 @@ -353,8 +352,8 @@ public void testTagLocation() throws Exception {
 HoodieClientTestUtils.writeParquetFile(basePath, "2015/01/31", 
Arrays.asList(record4), schema, null, true);
 
 // We do the tag again
-metadata = new HoodieTableMetaClient(jsc.hadoopConfiguration(), basePath);
 
 Review comment:
   I kind of prefer to have it be a new instance? instead of introducing a 
reload method.. Those things tend to get tricky over time (someone forget to 
reset one variable etc).. wdyt? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar merged pull request #924: [minor][docs-chinese] Improve translation

2019-09-25 Thread GitBox
vinothchandar merged pull request #924: [minor][docs-chinese] Improve 
translation
URL: https://github.com/apache/incubator-hudi/pull/924
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [minor][docs-chinese] Improve translation (#924)

2019-09-25 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 1ff036c  [minor][docs-chinese] Improve translation (#924)
1ff036c is described below

commit 1ff036c4264d5a7911419eecce6bc53fc3d13cca
Author: leesf <490081...@qq.com>
AuthorDate: Thu Sep 26 10:01:34 2019 +0800

[minor][docs-chinese] Improve translation (#924)
---
 docs/powered_by.cn.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/powered_by.cn.md b/docs/powered_by.cn.md
index faf5f2f..11930ee 100644
--- a/docs/powered_by.cn.md
+++ b/docs/powered_by.cn.md
@@ -11,12 +11,12 @@ toc: false
  Uber
 
 
Hudi最初由[Uber](https://uber.com)开发,用于实现[低延迟、高效率的数据库摄取](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32)。
-Hudi自2016年8月开始在生产环境上线,在Hadoop上驱动约100个非常关键的业务表,支撑约100亿TB的数据规模(前10名包括旅行,乘客,合作伙伴)。
+Hudi自2016年8月开始在生产环境上线,在Hadoop上驱动约100个非常关键的业务表,支撑约几百TB的数据规模(前10名包括行程、乘客、司机)。
 Hudi还支持几个增量的Hive ETL管道,并且目前已集成到Uber的数据分发系统中。
 
  EMIS Health
 
-[EMIS 
Health](https://www.emishealth.com/)是英国最大的初级保健IT软件提供商,其数据集包括超过500亿的医疗保健记录。HUDI用于管理生产中的分析数据集,并使其与上游源保持同步。Presto用于查询以HUDI格式写入的数据。
+[EMIS 
Health](https://www.emishealth.com/)是英国最大的初级保健IT软件提供商,其数据集包括超过5000亿的医疗保健记录。HUDI用于管理生产中的分析数据集,并使其与上游源保持同步。Presto用于查询以HUDI格式写入的数据。
 
  Yields.io
 



[incubator-hudi] branch asf-site updated: [HUDI-256] Translate Comparison page (#925)

2019-09-25 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new cef5769  [HUDI-256] Translate Comparison page (#925)
cef5769 is described below

commit cef57691228de09429cd8794117dee6fc8f729d2
Author: leesf <490081...@qq.com>
AuthorDate: Thu Sep 26 10:00:59 2019 +0800

[HUDI-256] Translate Comparison page (#925)

* [HUDI-256] Translate Comparison page

* [hotfix] address comments
---
 docs/comparison.cn.md | 71 ++-
 1 file changed, 31 insertions(+), 40 deletions(-)

diff --git a/docs/comparison.cn.md b/docs/comparison.cn.md
index a606c94..bb971b2 100644
--- a/docs/comparison.cn.md
+++ b/docs/comparison.cn.md
@@ -6,53 +6,44 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+通过将Hudi与一些相关系统进行对比,来了解Hudi如何适应当前的大数据生态系统,并知晓这些系统在设计中做的不同权衡仍将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+与之不同的是,Hudi旨在与底层Hadoop兼容的文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有做任何直接的基准测试来比较Kudu和Hudi(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines),
+我们预期Hudi在摄取parquet上有更卓越的性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现在ORC文件格式之上的存储`读取时合并`。
+可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作,并计划引入除parquet以外的文件格式。
 
-## Hive Transactions
+## HBase
 
-[Hive 
Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)
 is another similar effort, which tries to implement storage like
-`merge-on-read`, on top of ORC file format. Understandably, this feature is 
heavily tied to Hive and other efforts like 
[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP).
-Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
-the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
-Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相

[GitHub] [incubator-hudi] vinothchandar merged pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
vinothchandar merged pull request #925: [HUDI-256] Translate Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #857: http://hudi.apache.org/comparison.html# should mention Iceberg and DeltaLake

2019-09-25 Thread GitBox
leesf commented on issue #857: http://hudi.apache.org/comparison.html# should 
mention Iceberg and DeltaLake
URL: https://github.com/apache/incubator-hudi/issues/857#issuecomment-535286289
 
 
   Paste a pic 
![pic](http://cdn.qubole.com/wp-content/uploads/2019/09/Hive-ACID-selection-table.png)
 from the 
[post](https://www.qubole.com/blog/qubole-open-sources-multi-engine-support-for-updates-and-deletes-in-data-lakes/)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-283) Look at spark-shell and ensure that auto-tune for memory for spillable map has sane defaults

2019-09-25 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-283:


 Summary: Look at spark-shell and ensure that auto-tune for memory 
for spillable map has sane defaults
 Key: HUDI-283
 URL: https://issues.apache.org/jira/browse/HUDI-283
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Common Core
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-25 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938163#comment-16938163
 ] 

Xing Pan commented on HUDI-269:
---

I tried to run the same hudi app via hudi spark datasource writer:

 
{code:java}
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", KAFKA_SERVER)
  .option("subscribe", DEMO_11_TOPIC)
  .load()
  .select(from_confluent_avro(col("value"), SCHEMA_REGISTRY_CONF) as 
'data).select("data.*")
  .writeStream.format("org.apache.hudi")
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dateStr")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts")
  .option(HoodieWriteConfig.TABLE_NAME, DEMO_11_TABLE_NAME)
  .option("checkpointLocation", checkpointPath)

  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, DEMO_11_TABLE_NAME)
  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default")
  .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, HIVE_URL)

  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dateStr")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
classOf[SlashEncodedDayPartitionValueExtractor].getCanonicalName)
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")

  .outputMode(OutputMode.Append)
  .trigger(Trigger.ProcessingTime(5000))
  .start(outputPath)
  .awaitTermination()
{code}
 

 
{code:java}
spark-submit --class xxx.HudiSpark \
--jars \
xxx/hudi-spark-bundle-0.5.1-SNAPSHOT.jar,\
xxx/abris_2.11-3.0.1.jar,\
xxx/common-utils-5.3.0.jar,xxx/kafka-schema-registry-client-5.3.0.jar,xxx/kafka-avro-serializer-5.3.0.jar,xxx/common-config-5.3.0.jar
 \
--packages 
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3,org.apache.spark:spark-avro_2.11:2.4.3
 \
--conf spark.hadoop.fs.s3a.endpoint=s3-ap-east-1.amazonaws.com \
--conf spark.dynamicAllocation.executorIdleTimeout=10s \
--conf hoodie.embed.timeline.server=true \
--conf hoodie.filesystem.view.incr.timeline.sync.enable=true \
--conf hoodie.upsert.shuffle.parallelism=2 \
--executor-memory 1g \
my_test.jar
{code}
 

and push 300 records for every second, and the S3 request count is fairly low:

!image-2019-09-26-09-02-24-761.png!

I am not quite sure about the difference between datasource writer and delta 
streamer, as far as I know, when there is no data coming, request count is 
about the same, but if I push some record every second, *datasource writer* 
costs about 10 times lower request count than delta streamer.
[~vinoth] 

 

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-25 Thread Xing Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Pan updated HUDI-269:
--
Attachment: image-2019-09-26-09-02-24-761.png

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

2019-09-25 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-281:
---
Description: 
Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
*false*. Currently I had to modify the code to set *useJdbc* to *false* as 
there is not *DataSourceOption* through which I can specify this field when 
running Hudi code.

Here is the failure:
{noformat}
java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
  at 
org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
  at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
  at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
I was expecting this to fail through Spark, becuase *hive-exec* is not shaded 
inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This 
*SessionState* is coming from the spark-hive jar and obviously it does not 
accept the relocated *HiveConf*.

We in *EMR* are running into same problem when trying to integrate with Glue 
Catalog. For this we have to create Hive metastore client through 
*Hive.get(conf).getMsc()* instead of how it is being down now, so that 
alternate implementations of metastore can get created. However, because 
hive-exec is not shaded but HiveConf is relocated we run into same issues there.

It would not be recommended to shade *hive-exec* either because it itself is an 
Uber jar that shades a lot of things, and all of them would end up in 
*hudi-spark-bundle* jar. We would not want to head that route. That is why, we 
would suggest if we consider removing any shading of Hive libraries.

We can add a *Maven Profile* to shade, but that means it has to be activated by 
default otherwise it will fail default if *useJdbc* is set to false, and later 
when we commit *Glue Catalog* changes.

 

 

 

 

 

  was:
Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
*false*. Currently I had to modify the code to set *useJdbc* to *false* as 
there is not *DataSourceOption* through which I can specify this field when 
running Hudi code.

Here is the failure:
{noformat}
java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hado

[jira] [Created] (HUDI-282) Update documentation to reflect additional option of HiveSync via metastore

2019-09-25 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-282:


 Summary: Update documentation to reflect additional option of 
HiveSync via metastore
 Key: HUDI-282
 URL: https://issues.apache.org/jira/browse/HUDI-282
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Hive Integration
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-180) Move HiveSyncTool registration from hive server (jdbc) to metastore (thrift)

2019-09-25 Thread Nishith Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938144#comment-16938144
 ] 

Nishith Agarwal edited comment on HUDI-180 at 9/26/19 12:10 AM:


As of now, we did not shed any dependencies but provided support for folks to 
use metastore API in case a secure (kerberos) type of cluster setup is used.

In the long term, we need to make all the dependencies non-shaded so it can be 
pulled in from runtime.

The ticket has been merged, closing it now.


was (Author: nishith29):
As of now, we did not shed any dependencies but provided support for folks to 
use metastore API in case a secure (kerberos) type of cluster setup is used.

The ticket has been merged, closing it now.

> Move HiveSyncTool registration from hive server (jdbc) to metastore (thrift)
> 
>
> Key: HUDI-180
> URL: https://issues.apache.org/jira/browse/HUDI-180
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, the HiveSyncTool takes in the JDBC URL of the hive server as an 
> argument to be able to register hoodie datasets with the hive metastore. 
> One of the problems faced was when using HiveSyncTool in a secure HDFS 
> cluster environment using kerberos. The current implementation of JDBC does 
> not allow for registration in such an environment. The implementation can be 
> changed to support that but the consensus internally in our company has been 
> to move to metastore.
> This ticket is to propose this change. Let's discuss on this ticket and I can 
> follow this up with a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-180) Move HiveSyncTool registration from hive server (jdbc) to metastore (thrift)

2019-09-25 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-180:
-
Status: Closed  (was: Patch Available)

> Move HiveSyncTool registration from hive server (jdbc) to metastore (thrift)
> 
>
> Key: HUDI-180
> URL: https://issues.apache.org/jira/browse/HUDI-180
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, the HiveSyncTool takes in the JDBC URL of the hive server as an 
> argument to be able to register hoodie datasets with the hive metastore. 
> One of the problems faced was when using HiveSyncTool in a secure HDFS 
> cluster environment using kerberos. The current implementation of JDBC does 
> not allow for registration in such an environment. The implementation can be 
> changed to support that but the consensus internally in our company has been 
> to move to metastore.
> This ticket is to propose this change. Let's discuss on this ticket and I can 
> follow this up with a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-180) Move HiveSyncTool registration from hive server (jdbc) to metastore (thrift)

2019-09-25 Thread Nishith Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938144#comment-16938144
 ] 

Nishith Agarwal commented on HUDI-180:
--

As of now, we did not shed any dependencies but provided support for folks to 
use metastore API in case a secure (kerberos) type of cluster setup is used.

The ticket has been merged, closing it now.

> Move HiveSyncTool registration from hive server (jdbc) to metastore (thrift)
> 
>
> Key: HUDI-180
> URL: https://issues.apache.org/jira/browse/HUDI-180
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, the HiveSyncTool takes in the JDBC URL of the hive server as an 
> argument to be able to register hoodie datasets with the hive metastore. 
> One of the problems faced was when using HiveSyncTool in a secure HDFS 
> cluster environment using kerberos. The current implementation of JDBC does 
> not allow for registration in such an environment. The implementation can be 
> changed to support that but the consensus internally in our company has been 
> to move to metastore.
> This ticket is to propose this change. Let's discuss on this ticket and I can 
> follow this up with a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

2019-09-25 Thread Udit Mehrotra (Jira)
Udit Mehrotra created HUDI-281:
--

 Summary: HiveSync failure through Spark when useJdbc is set to 
false
 Key: HUDI-281
 URL: https://issues.apache.org/jira/browse/HUDI-281
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Hive Integration, Spark datasource
Reporter: Udit Mehrotra


Table creation with Hive sync through Spark fails, when I set *useJdbc* to 
*false*. Currently I had to modify the code to set *useJdbc* to *false* as 
there is not *DataSourceOption* through which I can specify this field when 
running Hudi code.

Here is the failure:
{noformat}
java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
  at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
  at 
org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
  at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
  at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
I was expecting this to fail through Spark, becuase *hive-exec* is not shaded 
inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This 
*SessionState* is coming from the spark-hive jar and obviously it does not 
accept the relocated *HiveConf*. 

 

We in *EMR*  are running into same problem when trying to integrate with Glue 
Catalog. For this we have to create Hive metastore client through 
*Hive.get(conf).getMsc()* instead of how it is being down now, so that 
alternate implementations of metastore can get created. However, because 
hive-exec is not shaded but HiveConf is relocated we run into same issues there.

It would not be recommended to shade *hive-exec* either because it itself is an 
Uber jar that shades a lot of things, and all of them would end up in 
*hudi-spark-bundle* jar. We would not want to head that route. That is why, we 
would suggest if we consider removing any shading of Hive libraries.

We can add a *Maven Profile* to shade, but that means it has to be activated by 
default otherwise it will fail default if *useJdbc* is set to false, and later 
when we commit *Glue Catalog*  changes.

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on issue #927: [HUDI-279] Fix regression in Schema Evolution due to PR-755

2019-09-25 Thread GitBox
bvaradar commented on issue #927: [HUDI-279] Fix regression in Schema Evolution 
due to PR-755
URL: https://github.com/apache/incubator-hudi/pull/927#issuecomment-535269126
 
 
   @n3nash @vinothchandar : Please review. I ran a long running job with 
deltastreamer and did not see any race-conditions 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
leesf commented on issue #925: [HUDI-256] Translate Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#issuecomment-535263384
 
 
   @yihua Thanks for you careful review. Updated the PR to address your 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #924: [minor][docs-chinese] Improve translation

2019-09-25 Thread GitBox
leesf commented on issue #924: [minor][docs-chinese] Improve translation
URL: https://github.com/apache/incubator-hudi/pull/924#issuecomment-535259250
 
 
   @vinothchandar Could you please merge this PR when you are free? Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #915: [HUDI-268] Shade and relocate Avro dependency in hadoop-mr-bundle

2019-09-25 Thread GitBox
umehrot2 commented on issue #915: [HUDI-268] Shade and relocate Avro dependency 
in hadoop-mr-bundle
URL: https://github.com/apache/incubator-hudi/pull/915#issuecomment-535251457
 
 
   @vinothchandar @bvaradar updated the PR


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328337037
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
 
 Review comment:
   “另一方面” => “与之不同的是”
   “与底层Hadoop兼容文件系统” => “与底层Hadoop兼容的文件系统”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328363112
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现像在ORC文件格式之上的`读取时合并`。
+可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作,并将随着时间的推移合并除parquet以外的文件格式。
 
-## Hive Transactions
+## HBase
 
-[Hive 
Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)
 is another similar effort, which tries to implement storage like
-`merge-on-read`, on top of ORC file format. Understandably, this feature is 
heavily tied to Hive and other efforts like 
[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP).
-Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
-the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
-Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相关联。
+鉴于HBase经过严格的写优化,它支持开箱即用的亚秒级更新,Hive-on-HBase允许用户查询该数据。 
但是,就分析工作负载的实际性能而言,Parquet/ORC之类的混合列式存储格式可以轻松击败HBase,因为这些工作负载主要是读取繁重的工作。
+Hudi弥补了更快的数据与分析存储格式之间的差距。从操作的角度来看,与管理分析使用的HBase 
region服务器集群相比,为用户提供可提供更快数据的库更具可扩展性。
 
 Review comment:
   ”操作的角度“ => “运营的角度”
   “为用户提供可提供更快数据的库” => “为用户提供可更快给出数据的库”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328342066
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
 Review comment:
   “我们希望Hudi定位于能吸纳parquet的卓越性能” => “我们预期Hudi在摄取parquet上有更卓越的性能”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328363986
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现像在ORC文件格式之上的`读取时合并`。
+可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作,并将随着时间的推移合并除parquet以外的文件格式。
 
-## Hive Transactions
+## HBase
 
-[Hive 
Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)
 is another similar effort, which tries to implement storage like
-`merge-on-read`, on top of ORC file format. Understandably, this feature is 
heavily tied to Hive and other efforts like 
[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP).
-Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
-the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
-Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相关联。
+鉴于HBase经过严格的写优化,它支持开箱即用的亚秒级更新,Hive-on-HBase允许用户查询该数据。 
但是,就分析工作负载的实际性能而言,Parquet/ORC之类的混合列式存储格式可以轻松击败HBase,因为这些工作负载主要是读取繁重的工作。
+Hudi弥补了更快的数据与分析存储格式之间的差距。从操作的角度来看,与管理分析使用的HBase 
region服务器集群相比,为用户提供可提供更快数据的库更具可扩展性。
+最终,HBase不像Hudi这样支持把`提交时间`、`增量拉取`之类的增量处理原语作为头等公民。
 
 Review comment:
   “支持把`提交时间`、`增量拉取`之类的增量处理原语作为头等公民” => “重点支持`提交时间`、`增量拉取`之类的增量处理原语”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328334755
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 Review comment:
   nit: "了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。"
   =>
   "通过将Hudi与一些相关系统进行对比,来了解Hudi如何适应当前的大数据生态系统,并知晓这些系统在设计中做的不同权衡仍将非常有用。"


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328343851
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现像在ORC文件格式之上的`读取时合并`。
 
 Review comment:
   “像在ORC文件格式之上的`读取时合并`” => “在ORC文件格式之上的存储`读取时合并`”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328345241
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现像在ORC文件格式之上的`读取时合并`。
+可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作,并将随着时间的推移合并除parquet以外的文件格式。
 
 Review comment:
   “并将随着时间的推移合并” => “并计划集成” / “并计划引入”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328340017
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
 
 Review comment:
   “我们还没有针对Kudu做任何正面的基准测试” => “我们还没有做任何直接的基准测试来比较Kudu和Hudi”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328366345
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现像在ORC文件格式之上的`读取时合并`。
+可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作,并将随着时间的推移合并除parquet以外的文件格式。
 
-## Hive Transactions
+## HBase
 
-[Hive 
Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)
 is another similar effort, which tries to implement storage like
-`merge-on-read`, on top of ORC file format. Understandably, this feature is 
heavily tied to Hive and other efforts like 
[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP).
-Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
-the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
-Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相关联。
+鉴于HBase经过严格的写优化,它支持开箱即用的亚秒级更新,Hive-on-HBase允许用户查询该数据。 
但是,就分析工作负载的实际性能而言,Parquet/ORC之类的混合列式存储格式可以轻松击败HBase,因为这些工作负载主要是读取繁重的工作。
+Hudi弥补了更快的数据与分析存储格式之间的差距。从操作的角度来看,与管理分析使用的HBase 
region服务器集群相比,为用户提供可提供更快数据的库更具可扩展性。
+最终,HBase不像Hudi这样支持把`提交时间`、`增量拉取`之类的增量处理原语作为头等公民。
 
-## HBase
+## 流式处理
+
+一个普遍的问题:"Hudi与流处理系统有何关系?",我们将在这里尝试回答。简而言之,Hudi可以与当今的批处理(`写时复制存储`)和流处理(`读时合并存储`)作业集成,以将计算结果存储在Hadoop中。
+对于Spark应用程序,这可以通过将Hudi库与Spark/Spark流式DAG直接集成来实现。在非Spark处理系统(例如Flink、Hive)情况下,可以在相应的系统中进行处理,然后通过Kafka主题/DFS中间文件将其发送到Hudi表中。从概念上讲,数据处理
+管道仅由三个部分组成:`输入`,`处理`,`输出`,用户最终针对输出运行查询以便使用管道的结果。Hudi可以充当将数据存储在DFS上的输入或输出。Hudi在给定流处理管道上的适用性最终归结为适用于
+Presto/SparkSQL/Hive的查询。
 
 Review comment:
 

[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328340357
 
 

 ##
 File path: docs/comparison.cn.md
 ##
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
+了解Hudi如何适应当前的大数据生态系统,并将其与一些相关系统进行对比,了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
+因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
+另一方面,Hudi旨在与底层Hadoop兼容文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache 
Spark来完成繁重的工作。
+因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
+到目前为止,我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎),
 
 Review comment:
   the hyperlink here shouldn't be translated...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-280) Integrate Hudi to bigtop

2019-09-25 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-280:
---

 Summary: Integrate Hudi to bigtop
 Key: HUDI-280
 URL: https://issues.apache.org/jira/browse/HUDI-280
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Usability
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-25 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937825#comment-16937825
 ] 

Xing Pan commented on HUDI-269:
---

[~vinoth] , I'm planing to use hudi in our data lake project and happy to 
contribute.

Since this naive throttle feature in this ticket will not actually solve the 
request issue completely, I will do some deeper investigation on this.

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: BALAJI VARADARAJAN
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-25 Thread Xing Pan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937358#comment-16937358
 ] 

Xing Pan edited comment on HUDI-269 at 9/25/19 2:56 PM:


 

[~vbalaji]

yea, these strange 5K requests are mainly head requests, and cause a lot of s3 
4xx error, which is defined as "client side error". 

I only have one partition "1100/01/01", and attached please find the 
*hudi_request_test.tar.gz*
{code:java}
aws s3 ls s3://xxx/output/1100/01/01/
2019-09-25 01:56:57 93 .hoodie_partition_metadata
2019-09-25 02:12:18 535993 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-111-99_20190925021213.parquet
2019-09-25 02:50:30 679546 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-126-108_20190925025025.parquet
2019-09-25 02:32:27 597943 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023222.parquet
2019-09-25 02:38:03 623372 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-135-117_20190925023758.parquet
2019-09-25 02:12:48 537971 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-149-130_20190925021243.parquet
2019-09-25 02:50:39 680323 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-161-136_20190925025033.parquet
2019-09-25 02:32:57 599788 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023252.parquet
2019-09-25 02:38:33 625295 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-173-148_20190925023828.parquet
2019-09-25 02:13:18 540308 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-187-161_20190925021313.parquet
2019-09-25 02:50:47 681076 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-196-164_20190925025042.parquet
2019-09-25 02:31:07 591207 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023057.parquet
2019-09-25 02:36:48 615894 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925023637.parquet
2019-09-25 02:50:01 675036 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-21-24_20190925024946.parquet
2019-09-25 02:33:27 602011 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023322.parquet
2019-09-25 02:39:03 627524 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-211-179_20190925023858.parquet
2019-09-25 02:13:48 542690 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-225-192_20190925021343.parquet
2019-09-25 02:50:55 681495 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-231-192_20190925025049.parquet
2019-09-25 02:33:57 604273 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023352.parquet
2019-09-25 02:39:33 629743 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-249-210_20190925023928.parquet
2019-09-25 02:14:18 545021 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-263-223_20190925021413.parquet
2019-09-25 02:51:03 682267 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-266-220_20190925025058.parquet
2019-09-25 02:34:27 606495 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023422.parquet
2019-09-25 02:40:03 632018 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-287-241_20190925023958.parquet
2019-09-25 02:51:11 682667 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-248_20190925025106.parquet
2019-09-25 02:14:48 547294 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-301-254_20190925021443.parquet
2019-09-25 02:34:57 608770 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925023452.parquet
2019-09-25 02:40:33 634280 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-325-272_20190925024028.parquet
2019-09-25 02:51:18 683418 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-336-276_20190925025113.parquet
2019-09-25 02:15:18 549588 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-339-285_20190925021513.parquet
2019-09-25 01:56:59 533148 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-35-37_20190925015651.parquet
2019-09-25 02:35:27 610998 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925023522.parquet
2019-09-25 02:41:04 636524 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-363-303_20190925024058.parquet
2019-09-25 02:51:26 683833 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-371-304_20190925025121.parquet
2019-09-25 02:15:48 551902 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-377-316_20190925021543.parquet
2019-09-25 02:35:57 613259 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925023552.parquet
2019-09-25 02:41:33 638757 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-401-334_20190925024128.parquet
2019-09-25 02:51:34 684572 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-406-332_20190925025130.parquet
2019-09-25 02:16:18 553820 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-415-347_20190925021613.parquet
2019-09-25 02:42:03 641007 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-439-365_20190925024158.parquet
2019-09-25 02:51:42 684965 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-441-360_20190925025137.parquet
2019-09-25 02:16:48 556070 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-453-378_20190925021643.parquet
2019-09-25 02:51:49 685729 
68d656cc-65a5-47f7-bf28-961315e718bc-0_0-476-388_20190925025144.parque

[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-25 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937760#comment-16937760
 ] 

Vinoth Chandar commented on HUDI-269:
-

This is super useful [~XingXPan]. I am currently looking into performance more 
holistically.. Will add this to one of items to consider. 

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: BALAJI VARADARAJAN
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-279) Regression in Schema Evolution due to PR-755

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-279:

Labels: pull-request-available  (was: )

> Regression in Schema Evolution due to PR-755
> 
>
> Key: HUDI-279
> URL: https://issues.apache.org/jira/browse/HUDI-279
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: BALAJI VARADARAJAN
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>
> Reported by Alex:
> [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L200]
> this sets an Avro Schema on the config
>  
> but I see that AvroReadSupport.init is getting a different config instance, 
> with avro schema set to null and falls back to what is in parquet. Which 
> breaks during the old/new data merge. I’m pretty sure it worked before as we 
> had successful schema evolutions. Any idea why it might be happening? 
>  
> Caused by changes in :
> [https://github.com/apache/incubator-hudi/pull/755]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar opened a new pull request #927: [WIP] [HUDI-279] Fix regression in Schema Evolution due to PR-755

2019-09-25 Thread GitBox
bvaradar opened a new pull request #927: [WIP] [HUDI-279] Fix regression in 
Schema Evolution due to PR-755
URL: https://github.com/apache/incubator-hudi/pull/927
 
 
   Jira: https://jira.apache.org/jira/browse/HUDI-279
   
   Reported by Alex:
   
   
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L200
   
   this sets an Avro Schema on the config

   but I see that AvroReadSupport.init is getting a different config instance, 
with avro schema set to null and falls back to what is in parquet. Which breaks 
during the old/new data merge. I’m pretty sure it worked before as we had 
successful schema evolutions. Any idea why it might be happening? 

   Caused by changes in :
   https://github.com/apache/incubator-hudi/pull/755


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-279) Regression in Schema Evolution due to PR-755

2019-09-25 Thread BALAJI VARADARAJAN (Jira)
BALAJI VARADARAJAN created HUDI-279:
---

 Summary: Regression in Schema Evolution due to PR-755
 Key: HUDI-279
 URL: https://issues.apache.org/jira/browse/HUDI-279
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Common Core
Reporter: BALAJI VARADARAJAN
 Fix For: 0.5.0


Reported by Alex:

[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L200]

this sets an Avro Schema on the config
 
but I see that AvroReadSupport.init is getting a different config instance, 
with avro schema set to null and falls back to what is in parquet. Which breaks 
during the old/new data merge. I’m pretty sure it worked before as we had 
successful schema evolutions. Any idea why it might be happening? 
 
Caused by changes in :
[https://github.com/apache/incubator-hudi/pull/755]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-278) Translate Administering page

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-278:

Labels: pull-request-available  (was: )

> Translate Administering page
> 
>
> Key: HUDI-278
> URL: https://issues.apache.org/jira/browse/HUDI-278
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: docs-chinese
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> he online HTML web page: [http://hudi.apache.org/admin_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf opened a new pull request #926: [HUDI-278] Translate Administering page

2019-09-25 Thread GitBox
leesf opened a new pull request #926: [HUDI-278] Translate Administering page
URL: https://github.com/apache/incubator-hudi/pull/926
 
 
   see [jira-278](https://jira.apache.org/jira/browse/HUDI-278)
   
   cc @yihua @yanghua Could you PTAL when you are free?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-256) Translate Comparison page

2019-09-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-256:

Labels: pull-request-available  (was: )

> Translate Comparison page
> -
>
> Key: HUDI-256
> URL: https://issues.apache.org/jira/browse/HUDI-256
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: docs-chinese
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> The online HTML web page: [https://hudi.apache.org/comparison.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf opened a new pull request #925: [HUDI-256] Translate Comparison page

2019-09-25 Thread GitBox
leesf opened a new pull request #925: [HUDI-256] Translate Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925
 
 
   see [jira-256](https://jira.apache.org/jira/browse/HUDI-256)
   
   cc @yihua @yanghua Could you PTAL when you are free?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-232) Implement sealing/unsealing for HoodieRecord class

2019-09-25 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937477#comment-16937477
 ] 

leesf commented on HUDI-232:


How about add seal and unseal methods to HoodieRecord? Error will thrown if 
modify HoodieRecord after sealed. and modification is allowed after unsealed. 
cc [~vinoth]

> Implement sealing/unsealing for HoodieRecord class
> --
>
> Key: HUDI-232
> URL: https://issues.apache.org/jira/browse/HUDI-232
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Write Client
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Priority: Major
>
> HoodieRecord class sometimes is modified to set the record location. We can 
> get into issues like HUDI-170 if the modification is misplaced. We need a 
> mechanism to seal the class and unseal for modification explicity.. Try to 
> modify in sealed state should throw an error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on issue #901: [HUDI-255] Translate Talks & Powered By page

2019-09-25 Thread GitBox
leesf commented on issue #901: [HUDI-255] Translate Talks & Powered By page
URL: https://github.com/apache/incubator-hudi/pull/901#issuecomment-534885490
 
 
   @yihua Thanks for the review. Open another 
[PR#924](https://github.com/apache/incubator-hudi/pull/924) to address your 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf opened a new pull request #924: [minor][docs-chinese] Improve translation

2019-09-25 Thread GitBox
leesf opened a new pull request #924: [minor][docs-chinese] Improve translation
URL: https://github.com/apache/incubator-hudi/pull/924
 
 
   This PR is opened to address the comments of this PR 
[901](https://github.com/apache/incubator-hudi/pull/901)
   
   cc @yihua @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services