This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new cef5769 [HUDI-256] Translate Comparison page (#925) cef5769 is described below commit cef57691228de09429cd8794117dee6fc8f729d2 Author: leesf <490081...@qq.com> AuthorDate: Thu Sep 26 10:00:59 2019 +0800 [HUDI-256] Translate Comparison page (#925) * [HUDI-256] Translate Comparison page * [hotfix] address comments --- docs/comparison.cn.md | 71 ++++++++++++++++++++++----------------------------- 1 file changed, 31 insertions(+), 40 deletions(-) diff --git a/docs/comparison.cn.md b/docs/comparison.cn.md index a606c94..bb971b2 100644 --- a/docs/comparison.cn.md +++ b/docs/comparison.cn.md @@ -6,53 +6,44 @@ permalink: comparison.html toc: false --- -Apache Hudi fills a big void for processing data on top of DFS, and thus mostly co-exists nicely with these technologies. However, -it would be useful to understand how Hudi fits into the current big data ecosystem, contrasting it with a few related systems -and bring out the different tradeoffs these systems have accepted in their design. +Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而, +通过将Hudi与一些相关系统进行对比,来了解Hudi如何适应当前的大数据生态系统,并知晓这些系统在设计中做的不同权衡仍将非常有用。 ## Kudu -[Apache Kudu](https://kudu.apache.org) is a storage system that has similar goals as Hudi, which is to bring real-time analytics on petabytes of data via first -class support for `upserts`. A key differentiator is that Kudu also attempts to serve as a datastore for OLTP workloads, something that Hudi does not aspire to be. -Consequently, Kudu does not support incremental pulling (as of early 2017), something Hudi does to enable incremental processing use cases. +[Apache Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。 +一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。 +因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。 +Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。 +与之不同的是,Hudi旨在与底层Hadoop兼容的文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache Spark来完成繁重的工作。 +因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。 +到目前为止,我们还没有做任何直接的基准测试来比较Kudu和Hudi(鉴于RTTable正在进行中)。 +但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines), +我们预期Hudi在摄取parquet上有更卓越的性能。 -Kudu diverges from a distributed file system abstraction and HDFS altogether, with its own set of storage servers talking to each other via RAFT. -Hudi, on the other hand, is designed to work with an underlying Hadoop compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of storage servers, -instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be scaled easily, just like other Spark jobs, while Kudu would require hardware -& operational support, typical to datastores like HBase or Vertica. We have not at this point, done any head to head benchmarks against Kudu (given RTTable is WIP). -But, if we were to go with results shared by [CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines) , -we expect Hudi to positioned at something that ingests parquet with superior performance. +## Hive事务 +[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现在ORC文件格式之上的存储`读取时合并`。 +可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。 +Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。 +在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。 +根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。 +Hudi还设计用于与Presto/Spark等非Hive引擎合作,并计划引入除parquet以外的文件格式。 -## Hive Transactions +## HBase -[Hive Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions) is another similar effort, which tries to implement storage like -`merge-on-read`, on top of ORC file format. Understandably, this feature is heavily tied to Hive and other efforts like [LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP). -Hive transactions does not offer the read-optimized storage option or the incremental pulling, that Hudi does. In terms of implementation choices, Hudi leverages -the full power of a processing framework like Spark, while Hive transactions feature is implemented underneath by Hive tasks/queries kicked off by user or the Hive metastore. -Based on our production experience, embedding Hudi as a library into existing Spark pipelines was much easier and less operationally heavy, compared with the other approach. -Hudi is also designed to work with non-hive enginers like Presto/Spark and will incorporate file formats other than parquet over time. +尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相关联。 +鉴于HBase经过严格的写优化,它支持开箱即用的亚秒级更新,Hive-on-HBase允许用户查询该数据。 但是,就分析工作负载的实际性能而言,Parquet/ORC之类的混合列式存储格式可以轻松击败HBase,因为这些工作负载主要是读取繁重的工作。 +Hudi弥补了更快的数据与分析存储格式之间的差距。从运营的角度来看,与管理分析使用的HBase region服务器集群相比,为用户提供可更快给出数据的库更具可扩展性。 +最终,HBase不像Hudi这样重点支持`提交时间`、`增量拉取`之类的增量处理原语。 -## HBase +## 流式处理 + +一个普遍的问题:"Hudi与流处理系统有何关系?",我们将在这里尝试回答。简而言之,Hudi可以与当今的批处理(`写时复制存储`)和流处理(`读时合并存储`)作业集成,以将计算结果存储在Hadoop中。 +对于Spark应用程序,这可以通过将Hudi库与Spark/Spark流式DAG直接集成来实现。在非Spark处理系统(例如Flink、Hive)情况下,可以在相应的系统中进行处理,然后通过Kafka主题/DFS中间文件将其发送到Hudi表中。从概念上讲,数据处理 +管道仅由三个部分组成:`输入`,`处理`,`输出`,用户最终针对输出运行查询以便使用管道的结果。Hudi可以充当将数据存储在DFS上的输入或输出。Hudi在给定流处理管道上的适用性最终归结为你的查询在Presto/SparkSQL/Hive的适用性。 -Even though [HBase](https://hbase.apache.org) is ultimately a key-value store for OLTP workloads, users often tend to associate HBase with analytics given the proximity to Hadoop. -Given HBase is heavily write-optimized, it supports sub-second upserts out-of-box and Hive-on-HBase lets users query that data. However, in terms of actual performance for analytical workloads, -hybrid columnar storage formats like Parquet/ORC handily beat HBase, since these workloads are predominantly read-heavy. Hudi bridges this gap between faster data and having -analytical storage formats. From an operational perspective, arming users with a library that provides faster data, is more scalable, than managing a big farm of HBase region servers, -just for analytics. Finally, HBase does not support incremental processing primitives like `commit times`, `incremental pull` as first class citizens like Hudi. - -## Stream Processing - -A popular question, we get is : "How does Hudi relate to stream processing systems?", which we will try to answer here. Simply put, Hudi can integrate with -batch (`copy-on-write storage`) and streaming (`merge-on-read storage`) jobs of today, to store the computed results in Hadoop. For Spark apps, this can happen via direct -integration of Hudi library with Spark/Spark streaming DAGs. In case of Non-Spark processing systems (eg: Flink, Hive), the processing can be done in the respective systems -and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In more conceptual level, data processing -pipelines just consist of three components : `source`, `processing`, `sink`, with users ultimately running queries against the sink to use the results of the pipeline. -Hudi can act as either a source or sink, that stores data on DFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability -of Presto/SparkSQL/Hive for your queries. - -More advanced use cases revolve around the concepts of [incremental processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop), which effectively -uses Hudi even inside the `processing` engine to speed up typical batch pipelines. For e.g: Hudi can be used as a state store inside a processing DAG (similar -to how [rocksDB](https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend) is used by Flink). This is an item on the roadmap -and will eventually happen as a [Beam Runner](https://issues.apache.org/jira/browse/HUDI-60) +更高级的用例围绕[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)的概念展开, +甚至在`处理`引擎内部也使用Hudi来加速典型的批处理管道。例如:Hudi可用作DAG内的状态存储(类似Flink使用的[rocksDB(https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend))。 +这是路线图上的一个项目并将最终以[Beam Runner](https://issues.apache.org/jira/browse/HUDI-60)的形式呈现。 \ No newline at end of file