This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit 7bb066d2985e4267552c0463109c06c9a7d1ab2e Author: Mergebot <merge...@apache.org> AuthorDate: Mon Aug 20 21:42:55 2018 +0000 Prepare repository for deployment. --- .../08/20/review-input-streaming-connectors.html | 436 +++++++++++++++++++++ content/blog/index.html | 17 + content/feed.xml | 300 ++++++++++---- content/index.html | 10 +- 4 files changed, 686 insertions(+), 77 deletions(-) diff --git a/content/blog/2018/08/20/review-input-streaming-connectors.html b/content/blog/2018/08/20/review-input-streaming-connectors.html new file mode 100644 index 0000000..1bdde6d --- /dev/null +++ b/content/blog/2018/08/20/review-input-streaming-connectors.html @@ -0,0 +1,436 @@ +<!DOCTYPE html> +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +<html lang="en"> + <!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +<head> + <meta charset="utf-8"> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>A review of input streaming connectors</title> + <meta name="description" content="In this post, you’ll learn about the current state of support for input streaming connectors in Apache Beam. For more context, you’ll also learn about the co..."> + <link href="https://fonts.googleapis.com/css?family=Roboto:100,300,400" rel="stylesheet"> + <link rel="stylesheet" href="/css/site.css"> + <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script src="/js/language-switch.js"></script> + <script src="/js/fix-menu.js"></script> + <script src="/js/section-nav.js"></script> + <script src="/js/page-nav.js"></script> + <link rel="canonical" href="https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html" data-proofer-ignore> + <link rel="shortcut icon" type="image/x-icon" href="/images/favicon.ico"> + <link rel="alternate" type="application/rss+xml" title="Apache Beam" href="https://beam.apache.org/feed.xml"> + <script> + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + ga('create', 'UA-73650088-1', 'auto'); + ga('send', 'pageview'); + </script> +</head> + + <body class="body "> + <!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +<nav class="header navbar navbar-fixed-top"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" aria-expanded="false" aria-controls="navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + + <a href="/" class="navbar-brand" > + <img alt="Brand" style="height: 25px" src="/images/beam_logo_navbar.png"> + </a> + </div> + + <div class="navbar-mask closed"></div> + + <div id="navbar" class="navbar-container closed"> + <ul class="nav navbar-nav"> + <li> + <a href="/get-started/beam-overview/">Get Started</a> + </li> + <li> + <a href="/documentation/">Documentation</a> + </li> + <li> + <a href="/documentation/sdks/java/">SDKS</a> + </li> + <li> + <a href="/documentation/runners/capability-matrix/">RUNNERS</a> + </li> + <li> + <a href="/contribute/">Contribute</a> + </li> + <li> + <a href="/community/contact-us/">Community</a> + </li> + <li><a href="/blog">Blog</a></li> + </ul> + <ul class="nav navbar-nav navbar-right"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false"><img src="https://www.apache.org/foundation/press/kit/feather_small.png" alt="Apache Logo" style="height:20px;"><span class="caret"></span></a> + <ul class="dropdown-menu dropdown-menu-right"> + <li><a href="http://www.apache.org/">ASF Homepage</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="https://www.apache.org/foundation/policies/conduct">Code of Conduct</a></li> + </ul> + </li> + </ul> + </div> +</nav> + + <div class="body__contained"> + <!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + + + +<article class="post" itemscope itemtype="http://schema.org/BlogPosting"> + + <header class="post-header"> + <h1 class="post-title" itemprop="name headline">A review of input streaming connectors</h1> + <p class="post-meta"><time datetime="2018-08-20T01:00:01-07:00" itemprop="datePublished">Aug 20, 2018</time> • + Leonid Kuligin [<a href="https://twitter.com/lkulighin">@lkulighin</a>] & Julien Phalip [<a href="https://twitter.com/julienphalip">@julienphalip</a>] + + </p> + </header> + + <div class="post-content" itemprop="articleBody"> + <p>In this post, you’ll learn about the current state of support for input streaming connectors in <a href="/">Apache Beam</a>. For more context, you’ll also learn about the corresponding state of support in <a href="https://spark.apache.org/">Apache Spark</a>.<!--more--></p> + +<p>With batch processing, you might load data from any source, including a database system. Even if there are no specific SDKs available for those database systems, you can often resort to using a <a href="https://en.wikipedia.org/wiki/Java_Database_Connectivity">JDBC</a> driver. With streaming, implementing a proper data pipeline is arguably more challenging as generally fewer source types are available. For that reason, this article particularly focuses on the streaming use case.</p> + +<h2 id="connectors-for-java">Connectors for Java</h2> + +<p>Beam has an official <a href="/documentation/sdks/java/">Java SDK</a> and has several execution engines, called <a href="/documentation/runners/capability-matrix/">runners</a>. In most cases it is fairly easy to transfer existing Beam pipelines written in Java or Scala to a Spark environment by using the <a href="/documentation/runners/spark/">Spark Runner</a>.</p> + +<p>Spark is written in Scala and has a <a href="https://spark.apache.org/docs/latest/api/java/">Java API</a>. Spark’s source code compiles to <a href="https://en.wikipedia.org/wiki/Java_(programming_language)#Java_JVM_and_Bytecode">Java bytecode</a> and the binaries are run by a <a href="https://en.wikipedia.org/wiki/Java_virtual_machine">Java Virtual Machine</a>. Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa).</p> + +<p>Spark offers two approaches to streaming: <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html">Discretized Streaming</a> (or DStreams) and <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html">Structured Streaming</a>. DStreams are a basic abstraction that represents a continuous series of <a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html">Resilient Distributed Datasets</a> (or RDDs). Structured Str [...] + +<p>Spark Structured Streaming supports <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html">file sources</a> (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and <a href="https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html">Kafka</a> as streaming <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources">inputs</a>. Spark maintains bui [...] + +<p>Below are the main streaming input connectors for available for Beam and Spark DStreams in Java:</p> + +<table class="table table-bordered"> + <tr> + <td> + </td> + <td> + </td> + <td><strong>Apache Beam</strong> + </td> + <td><strong>Apache Spark DStreams</strong> + </td> + </tr> + <tr> + <td rowspan="2">File Systems + </td> + <td>Local<br />(Using the <code>file://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/StreamingContext.html#textFileStream-java.lang.String-">textFileStream</a><br />(Spark treats most Unix systems as HDFS-compatible, but the location should be accessible from all nodes) + </td> + </tr> + <tr> + <td>HDFS<br />(Using the <code>hdfs://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/FileIO.html">FileIO</a> + <a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/util/HdfsUtils.html">HdfsUtils</a> + </td> + </tr> + <tr> + <td rowspan="2">Object Stores + </td> + <td>Cloud Storage<br />(Using the <code>gs://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/FileIO.html">FileIO</a> + <a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/extensions/gcp/options/GcsOptions.html">GcsOptions</a> + </td> + <td rowspan="2"><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--">hadoopConfiguration</a> +and <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/StreamingContext.html#textFileStream-java.lang.String-">textFileStream</a> + </td> + </tr> + <tr> + <td>S3<br />(Using the <code>s3://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/FileIO.html">FileIO</a> + <a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/aws/options/S3Options.html">S3Options</a> + </td> + </tr> + <tr> + <td rowspan="3">Messaging Queues + </td> + <td>Kafka + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/kafka/KafkaIO.html">KafkaIO</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html">spark-streaming-kafka</a> + </td> + </tr> + <tr> + <td>Kinesis + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/kinesis/KinesisIO.html">KinesisIO</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/streaming-kinesis-integration.html">spark-streaming-kinesis</a> + </td> + </tr> + <tr> + <td>Cloud Pub/Sub + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html">PubsubIO</a> + </td> + <td><a href="https://github.com/apache/bahir/tree/master/streaming-pubsub">spark-streaming-pubsub</a> from <a href="http://bahir.apache.org">Apache Bahir</a> + </td> + </tr> + <tr> + <td>Other + </td> + <td>Custom receivers + </td> + <td><a href="/documentation/io/authoring-overview/#read-transforms">Read Transforms</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/streaming-custom-receivers.html">receiverStream</a> + </td> + </tr> +</table> + +<h2 id="connectors-for-python">Connectors for Python</h2> + +<p>Beam has an official <a href="/documentation/sdks/python/">Python SDK</a> that currently supports a subset of the streaming features available in the Java SDK. Active development is underway to bridge the gap between the featuresets in the two SDKs. Currently for Python, the <a href="/documentation/runners/direct/">Direct Runner</a> and <a href="/documentation/runners/dataflow/">Dataflow Runner</a> are supported, and <a href="/documentation/sdks/python-streaming/">several streaming op [...] + +<p>Spark also has a Python SDK called <a href="http://spark.apache.org/docs/latest/api/python/pyspark.html">PySpark</a>. As mentioned earlier, Scala code compiles to a bytecode that is executed by the JVM. PySpark uses <a href="https://www.py4j.org/">Py4J</a>, a library that enables Python programs to interact with the JVM and therefore access Java libraries, interact with Java objects, and register callbacks from Java. This allows PySpark to access native Spark objects like RDDs. Spark [...] + +<p>Below are the main streaming input connectors for available for Beam and Spark DStreams in Python:</p> + +<table class="table table-bordered"> + <tr> + <td> + </td> + <td> + </td> + <td><strong>Apache Beam</strong> + </td> + <td><strong>Apache Spark DStreams</strong> + </td> + </tr> + <tr> + <td rowspan="2">File Systems + </td> + <td>Local + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.textio.html">io.textio</a> + </td> + <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a> + </td> + </tr> + <tr> + <td>HDFS + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.hadoopfilesystem.html">io.hadoopfilesystem</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--">hadoopConfiguration</a> (Access through <code>sc._jsc</code> with Py4J) +and <a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a> + </td> + </tr> + <tr> + <td rowspan="2">Object stores + </td> + <td>Google Cloud Storage + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.gcp.gcsio.html">io.gcp.gcsio</a> + </td> + <td rowspan="2"><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a> + </td> + </tr> + <tr> + <td>S3 + </td> + <td>N/A + </td> + </tr> + <tr> + <td rowspan="3">Messaging Queues + </td> + <td>Kafka + </td> + <td>N/A + </td> + <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.kafka.KafkaUtils">KafkaUtils</a> + </td> + </tr> + <tr> + <td>Kinesis + </td> + <td>N/A + </td> + <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#module-pyspark.streaming.kinesis">KinesisUtils</a> + </td> + </tr> + <tr> + <td>Cloud Pub/Sub + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.gcp.pubsub.html">io.gcp.pubsub</a> + </td> + <td>N/A + </td> + </tr> + <tr> + <td>Other + </td> + <td>Custom receivers + </td> + <td><a href="/documentation/sdks/python-custom-io/">BoundedSource and RangeTracker</a> + </td> + <td>N/A + </td> + </tr> +</table> + +<h2 id="connectors-for-other-languages">Connectors for other languages</h2> + +<h3 id="scala">Scala</h3> + +<p>Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a <a href="https://github.com/spotify/scio">Scala API</a> open-sourced <a href="https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/">by Spotify</a>.</p> + +<h3 id="go">Go</h3> + +<p>A <a href="/documentation/sdks/go/">Go SDK</a> for Apache Beam is under active development. It is currently experimental and is not recommended for production. Spark does not have an official Go SDK.</p> + +<h3 id="r">R</h3> + +<p>Apache Beam does not have an official R SDK. Spark Structured Streaming is supported by an <a href="https://spark.apache.org/docs/latest/sparkr.html#structured-streaming">R SDK</a>, but only for <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources">file sources</a> as a streaming input.</p> + +<h2 id="next-steps">Next steps</h2> + +<p>We hope this article inspired you to try new and interesting ways of connecting streaming sources to your Beam pipelines!</p> + +<p>Check out the following links for further information:</p> + +<ul> + <li>See a full list of all built-in and in-progress <a href="/documentation/io/built-in/">I/O Transforms</a> for Apache Beam.</li> + <li>Learn about some Apache Beam mobile gaming pipeline <a href="/get-started/mobile-gaming-example/">examples</a>.</li> +</ul> + + </div> + +</article> + + </div> + <!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +<footer class="footer"> + <div class="footer__contained"> + <div class="footer__cols"> + <div class="footer__cols__col"> + <div class="footer__cols__col__logo"> + <img src="/images/beam_logo_circle.svg" class="footer__logo" alt="Beam logo"> + </div> + <div class="footer__cols__col__logo"> + <img src="/images/apache_logo_circle.svg" class="footer__logo" alt="Apache logo"> + </div> + </div> + <div class="footer__cols__col footer__cols__col--md"> + <div class="footer__cols__col__title">Start</div> + <div class="footer__cols__col__link"><a href="/get-started/beam-overview/">Overview</a></div> + <div class="footer__cols__col__link"><a href="/get-started/quickstart-java/">Quickstart (Java)</a></div> + <div class="footer__cols__col__link"><a href="/get-started/quickstart-py/">Quickstart (Python)</a></div> + <div class="footer__cols__col__link"><a href="/get-started/quickstart-go/">Quickstart (Go)</a></div> + <div class="footer__cols__col__link"><a href="/get-started/downloads/">Downloads</a></div> + </div> + <div class="footer__cols__col footer__cols__col--md"> + <div class="footer__cols__col__title">Docs</div> + <div class="footer__cols__col__link"><a href="/documentation/programming-guide/">Concepts</a></div> + <div class="footer__cols__col__link"><a href="/documentation/pipelines/design-your-pipeline/">Pipelines</a></div> + <div class="footer__cols__col__link"><a href="/documentation/runners/capability-matrix/">Runners</a></div> + </div> + <div class="footer__cols__col footer__cols__col--md"> + <div class="footer__cols__col__title">Community</div> + <div class="footer__cols__col__link"><a href="/contribute/">Contribute</a></div> + <div class="footer__cols__col__link"><a href="https://projects.apache.org/committee.html?beam" target="_blank">Team<img src="/images/external-link-icon.png" + width="14" height="14" + alt="External link."></a></div> + <div class="footer__cols__col__link"><a href="/contribute/presentation-materials/">Media</a></div> + </div> + <div class="footer__cols__col footer__cols__col--md"> + <div class="footer__cols__col__title">Resources</div> + <div class="footer__cols__col__link"><a href="/blog/">Blog</a></div> + <div class="footer__cols__col__link"><a href="/get-started/support/">Support</a></div> + <div class="footer__cols__col__link"><a href="https://github.com/apache/beam">GitHub</a></div> + </div> + </div> + </div> + <div class="footer__bottom"> + © + <a href="http://www.apache.org">The Apache Software Foundation</a> + | <a href="/privacy_policy">Privacy Policy</a> + | <a href="/feed.xml">RSS Feed</a> + <br><br> + Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are + either registered trademarks or trademarks of The Apache Software + Foundation. All other products or name brands are trademarks of their + respective holders, including The Apache Software Foundation. + </div> +</footer> + + </body> +</html> diff --git a/content/blog/index.html b/content/blog/index.html index 8040636..08b2478 100644 --- a/content/blog/index.html +++ b/content/blog/index.html @@ -139,6 +139,23 @@ limitations under the License. <p>This is the blog for the Apache Beam project. This blog contains news and updates for the project.</p> +<h3 id="a-review-of-input-streaming-connectors"><a class="post-link" href="/blog/2018/08/20/review-input-streaming-connectors.html">A review of input streaming connectors</a></h3> +<p><i>Aug 20, 2018 • + Leonid Kuligin [<a href="https://twitter.com/lkulighin">@lkulighin</a>] & Julien Phalip [<a href="https://twitter.com/julienphalip">@julienphalip</a>] +</i></p> + +<p>In this post, you’ll learn about the current state of support for input streaming connectors in <a href="/">Apache Beam</a>. For more context, you’ll also learn about the corresponding state of support in <a href="https://spark.apache.org/">Apache Spark</a>.</p> + +<!-- Render a "read more" button if the post is longer than the excerpt --> + +<p> +<a class="btn btn-default btn-sm" href="/blog/2018/08/20/review-input-streaming-connectors.html" role="button"> +Read more <span class="glyphicon glyphicon-menu-right" aria-hidden="true"></span> +</a> +</p> + +<hr /> + <h3 id="apache-beam-260"><a class="post-link" href="/blog/2018/08/10/beam-2.6.0.html">Apache Beam 2.6.0</a></h3> <p><i>Aug 10, 2018 • Pablo Estrada [<a href="https://twitter.com/polecitoem">@polecitoem</a>] & Rafael Fernández diff --git a/content/feed.xml b/content/feed.xml index a77d2a6..4fa53fd 100644 --- a/content/feed.xml +++ b/content/feed.xml @@ -20,6 +20,234 @@ <generator>Jekyll v3.2.0</generator> <item> + <title>A review of input streaming connectors</title> + <description><p>In this post, you’ll learn about the current state of support for input streaming connectors in <a href="/">Apache Beam</a>. For more context, you’ll also learn about the corresponding state of support in <a href="https://spark.apache.org/">Apache Spark</a>.<!--more--></p> + +<p>With batch processing, you might load data from any source, including a database system. Even if there are no specific SDKs available for those database systems, you can often resort to using a <a href="https://en.wikipedia.org/wiki/Java_Database_Connectivity">JDBC</a> driver. With streaming, implementing a proper data pipeline is arguably more challenging as generally fewer source types are available. For that reason, this article particularly focuses on t [...] + +<h2 id="connectors-for-java">Connectors for Java</h2> + +<p>Beam has an official <a href="/documentation/sdks/java/">Java SDK</a> and has several execution engines, called <a href="/documentation/runners/capability-matrix/">runners</a>. In most cases it is fairly easy to transfer existing Beam pipelines written in Java or Scala to a Spark environment by using the <a href="/documentation/runners/spark/">Spark Runner</a>.</p> + +<p>Spark is written in Scala and has a <a href="https://spark.apache.org/docs/latest/api/java/">Java API</a>. Spark’s source code compiles to <a href="https://en.wikipedia.org/wiki/Java_(programming_language)#Java_JVM_and_Bytecode">Java bytecode</a> and the binaries are run by a <a href="https://en.wikipedia.org/wiki/Java_virtual_machine">Java Virtual Machine</a>. Scala code is interoperable with Java and therefore h [...] + +<p>Spark offers two approaches to streaming: <a href="https://spark.apache.org/docs/latest/streaming-programming-guide.html">Discretized Streaming</a> (or DStreams) and <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html">Structured Streaming</a>. DStreams are a basic abstraction that represents a continuous series of <a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html&quo [...] + +<p>Spark Structured Streaming supports <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html">file sources</a> (local filesystems and HDFS-compatible systems like Cloud Storage or S3) and <a href="https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html">Kafka</a> as streaming <a href="https://spark.apache.org/docs/latest/structured-streaming-programming [...] + +<p>Below are the main streaming input connectors for available for Beam and Spark DStreams in Java:</p> + +<table class="table table-bordered"> + <tr> + <td> + </td> + <td> + </td> + <td><strong>Apache Beam</strong> + </td> + <td><strong>Apache Spark DStreams</strong> + </td> + </tr> + <tr> + <td rowspan="2">File Systems + </td> + <td>Local<br />(Using the <code>file://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/TextIO.html">TextIO</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/StreamingContext.html#textFileStream-java.lang.String-">textFileStream</a><br />(Spark treats most Unix systems as HDFS-compatible, but the location should be accessible from all nodes) + </td> + </tr> + <tr> + <td>HDFS<br />(Using the <code>hdfs://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/FileIO.html">FileIO</a> + <a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.html">HadoopFileSystemOptions</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/util/HdfsUtils.html">HdfsUtils</a> + </td> + </tr> + <tr> + <td rowspan="2">Object Stores + </td> + <td>Cloud Storage<br />(Using the <code>gs://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/FileIO.html">FileIO</a> + <a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/extensions/gcp/options/GcsOptions.html">GcsOptions</a> + </td> + <td rowspan="2"><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--">hadoopConfiguration</a> +and <a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/StreamingContext.html#textFileStream-java.lang.String-">textFileStream</a> + </td> + </tr> + <tr> + <td>S3<br />(Using the <code>s3://</code> URI) + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/FileIO.html">FileIO</a> + <a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/aws/options/S3Options.html">S3Options</a> + </td> + </tr> + <tr> + <td rowspan="3">Messaging Queues + </td> + <td>Kafka + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/kafka/KafkaIO.html">KafkaIO</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html">spark-streaming-kafka</a> + </td> + </tr> + <tr> + <td>Kinesis + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/kinesis/KinesisIO.html">KinesisIO</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/streaming-kinesis-integration.html">spark-streaming-kinesis</a> + </td> + </tr> + <tr> + <td>Cloud Pub/Sub + </td> + <td><a href="/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html">PubsubIO</a> + </td> + <td><a href="https://github.com/apache/bahir/tree/master/streaming-pubsub">spark-streaming-pubsub</a> from <a href="http://bahir.apache.org">Apache Bahir</a> + </td> + </tr> + <tr> + <td>Other + </td> + <td>Custom receivers + </td> + <td><a href="/documentation/io/authoring-overview/#read-transforms">Read Transforms</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/streaming-custom-receivers.html">receiverStream</a> + </td> + </tr> +</table> + +<h2 id="connectors-for-python">Connectors for Python</h2> + +<p>Beam has an official <a href="/documentation/sdks/python/">Python SDK</a> that currently supports a subset of the streaming features available in the Java SDK. Active development is underway to bridge the gap between the featuresets in the two SDKs. Currently for Python, the <a href="/documentation/runners/direct/">Direct Runner</a> and <a href="/documentation/runners/dataflow/">Dataflow Runner</a> are supported, [...] + +<p>Spark also has a Python SDK called <a href="http://spark.apache.org/docs/latest/api/python/pyspark.html">PySpark</a>. As mentioned earlier, Scala code compiles to a bytecode that is executed by the JVM. PySpark uses <a href="https://www.py4j.org/">Py4J</a>, a library that enables Python programs to interact with the JVM and therefore access Java libraries, interact with Java objects, and register callbacks from Java. This allows PySpar [...] + +<p>Below are the main streaming input connectors for available for Beam and Spark DStreams in Python:</p> + +<table class="table table-bordered"> + <tr> + <td> + </td> + <td> + </td> + <td><strong>Apache Beam</strong> + </td> + <td><strong>Apache Spark DStreams</strong> + </td> + </tr> + <tr> + <td rowspan="2">File Systems + </td> + <td>Local + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.textio.html">io.textio</a> + </td> + <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a> + </td> + </tr> + <tr> + <td>HDFS + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.hadoopfilesystem.html">io.hadoopfilesystem</a> + </td> + <td><a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--">hadoopConfiguration</a> (Access through <code>sc._jsc</code> with Py4J) +and <a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a> + </td> + </tr> + <tr> + <td rowspan="2">Object stores + </td> + <td>Google Cloud Storage + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.gcp.gcsio.html">io.gcp.gcsio</a> + </td> + <td rowspan="2"><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.textFileStream">textFileStream</a> + </td> + </tr> + <tr> + <td>S3 + </td> + <td>N/A + </td> + </tr> + <tr> + <td rowspan="3">Messaging Queues + </td> + <td>Kafka + </td> + <td>N/A + </td> + <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.kafka.KafkaUtils">KafkaUtils</a> + </td> + </tr> + <tr> + <td>Kinesis + </td> + <td>N/A + </td> + <td><a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#module-pyspark.streaming.kinesis">KinesisUtils</a> + </td> + </tr> + <tr> + <td>Cloud Pub/Sub + </td> + <td><a href="/documentation/sdks/pydoc/2.6.0/apache_beam.io.gcp.pubsub.html">io.gcp.pubsub</a> + </td> + <td>N/A + </td> + </tr> + <tr> + <td>Other + </td> + <td>Custom receivers + </td> + <td><a href="/documentation/sdks/python-custom-io/">BoundedSource and RangeTracker</a> + </td> + <td>N/A + </td> + </tr> +</table> + +<h2 id="connectors-for-other-languages">Connectors for other languages</h2> + +<h3 id="scala">Scala</h3> + +<p>Since Scala code is interoperable with Java and therefore has native compatibility with Java libraries (and vice versa), you can use the same Java connectors described above in your Scala programs. Apache Beam also has a <a href="https://github.com/spotify/scio">Scala API</a> open-sourced <a href="https://labs.spotify.com/2017/10/16/big-data-processing-at-spotify-the-road-to-scio-part-1/">by Spotify</a>.</p> + +<h3 id="go">Go</h3> + +<p>A <a href="/documentation/sdks/go/">Go SDK</a> for Apache Beam is under active development. It is currently experimental and is not recommended for production. Spark does not have an official Go SDK.</p> + +<h3 id="r">R</h3> + +<p>Apache Beam does not have an official R SDK. Spark Structured Streaming is supported by an <a href="https://spark.apache.org/docs/latest/sparkr.html#structured-streaming">R SDK</a>, but only for <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources">file sources</a> as a streaming input.</p> + +<h2 id="next-steps">Next steps</h2> + +<p>We hope this article inspired you to try new and interesting ways of connecting streaming sources to your Beam pipelines!</p> + +<p>Check out the following links for further information:</p> + +<ul> + <li>See a full list of all built-in and in-progress <a href="/documentation/io/built-in/">I/O Transforms</a> for Apache Beam.</li> + <li>Learn about some Apache Beam mobile gaming pipeline <a href="/get-started/mobile-gaming-example/">examples</a>.</li> +</ul> +</description> + <pubDate>Mon, 20 Aug 2018 01:00:01 -0700</pubDate> + <link>https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html</link> + <guid isPermaLink="true">https://beam.apache.org/blog/2018/08/20/review-input-streaming-connectors.html</guid> + + + <category>blog</category> + + </item> + + <item> <title>Apache Beam 2.6.0</title> <description><p>We are glad to present the new 2.6.0 release of Beam. This release includes multiple fixes and new functionality, such as new features in SQL and portability.<!--more--> @@ -2380,77 +2608,5 @@ hear from you.</p> </item> - <item> - <title>Media recap of the Apache Beam graduation</title> - <description><!-- -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. ---> - -<p>One year ago today Apache Beam was accepted into incubation at the Apache -Software Foundation. The community’s work over the past year culminated, just -over three weeks ago, with an <a href="/blog/2017/01/10/beam-graduates.html">announcement</a> -that Apache Beam has successfully graduated as a new Top-Level Project at the -foundation. Graduation sparked an additional interest in the project, from -corporate endorsements, news articles, interviews, to the volume of traffic to -our website and mailing lists.</p> - -<!--more--> - -<p>Corporate endorsements include Google, PayPal, Talend, data Artisans, and -others. You can read more in the following blog posts:</p> -<ul> - <li>Google: “<a href="https://opensource.googleblog.com/2017/01/apache-beam-graduates.html">Apache Beam graduates to a top-level project</a>” by Tyler Akidau.</li> - <li>Talend: “<a href="https://www.talend.com/blog/2017/01/13/future-apache-beam-now-top-level-apache-software-foundation-project/">The Future of Apache Beam, Now a Top-Level Apache Software Foundation Project</a>” by Jean-Baptiste Onofré.</li> - <li>Talend: “<a href="https://www.talend.com/blog/2017/01/23/apache-beam-way-greater-data-agility/?utm_medium=socialpost&amp;utm_source=twitter&amp;utm_campaign=blog">Apache Beam Your Way to Greater Data Agility</a>” by Shane Kent.</li> - <li>Google: “<a href="https://cloud.google.com/blog/big-data/2017/01/apache-beam-graduates-from-incubation-try-it-today-on-google-cloud-dataflow">Apache Beam graduates from incubation: Try it today on Google Cloud Dataflow</a>” by Frances Perry.</li> -</ul> - -<p>News coverage started with the Apache Software Foundation’s press release in -<a href="https://globenewswire.com/news-release/2017/01/10/904692/0/en/The-Apache-Software-Foundation-Announces-Apache-Beam-as-a-Top-Level-Project.html">Nasdaq GlobeNewswire</a>, -and followed by coverage in many independent outlets. Some of those in English -include:</p> -<ul> - <li>ZDNet: “<a href="http://www.zdnet.com/article/apache-beam-and-spark-new-coopetition-for-squashing-the-lambda-architecture/">Apache Beam and Spark: New coopetition for squashing the Lambda Architecture?</a>” by Tony Baer.</li> - <li>Datanami: “<a href="https://www.datanami.com/2017/01/10/google-lauds-outside-influence-apache-beam/">Google Lauds Outside Influence on Apache Beam</a>” by Alex Woodie.</li> - <li>InfoWorld / JavaWorld: “<a href="http://www.infoworld.com/article/3156598/big-data/apache-beam-unifies-batch-and-streaming-for-big-data.html">Apache Beam unifies batch and streaming for big data</a>” by Serdar Yegulalp, and republished in <a href="http://www.javaworld.com/article/3156598/big-data/apache-beam-unifies-batch-and-streaming-for-big-data.html">JavaWorld</a>.</li> - <li>JAXenter: “<a href="https://jaxenter.com/apache-beam-interview-131314.html">In a way, Apache Beam is the glue that connects many big data systems together</a>” by Kypriani Sinaris.</li> - <li>OStatic: “Apache Beam Unifies Batch and Streaming Data Processing” by Sam Dean. <!-- http://ostatic.com/blog/apache-beam-unifies-batch-and-streaming-data-processing --></li> - <li>Enterprise Apps Today: “<a href="http://www.enterpriseappstoday.com/business-intelligence/data-analytics/apache-beam-graduates-to-help-define-streaming-data-processing.html">Apache Beam Graduates to Help Define Streaming Data Processing</a>” by Sean Michael Kerner.</li> - <li>The Register: “<a href="http://www.theregister.co.uk/2017/01/10/google_must_be_ibeamiing_as_apache_announces_its_new_top_level_projects/">Google must be Beaming as Apache announces its new top-level projects</a>” by Alexander J. Martin.</li> - <li>SiliconANGLE: “<a href="http://siliconangle.com/blog/2017/01/11/apache-software-foundation-announces-2-top-level-projects/">Apache Software Foundation announces two more top-level open source projects</a>” by Mike Wheatley.</li> - <li>SD Times: “<a href="http://sdtimes.com/apache-beam-goes-top-level/">Apache Beam goes top level</a>” by Alex Handy.</li> -</ul> - -<p>Graduation and media coverage helped push Beam website traffic to record levels. -The website traffic, measured in unique sessions per hour, peaked at more than -15 times above the previous week’s numbers. In a steady state, the traffic is -several times larger than before graduation.</p> - -<p>Hopefully these perspectives entice you to join us on this exciting ride, either -as a user or a contributor, as we work towards our first release with API -stability. If you’d like to try out Apache Beam today, check out the latest -<a href="/get-started/downloads/">0.4.0 release</a>. We welcome -contribution and participation from anyone through our mailing lists, issue -tracker, pull requests, and events.</p> -</description> - <pubDate>Wed, 01 Feb 2017 00:00:01 -0800</pubDate> - <link>https://beam.apache.org/blog/2017/02/01/graduation-media-recap.html</link> - <guid isPermaLink="true">https://beam.apache.org/blog/2017/02/01/graduation-media-recap.html</guid> - - - <category>blog</category> - - </item> - </channel> </rss> diff --git a/content/index.html b/content/index.html index 7728b76..33e68cf 100644 --- a/content/index.html +++ b/content/index.html @@ -162,6 +162,11 @@ limitations under the License. </div> <div class="hero__blog__cards"> + <a class="hero__blog__cards__card" href="/blog/2018/08/20/review-input-streaming-connectors.html"> + <div class="hero__blog__cards__card__title">A review of input streaming connectors</div> + <div class="hero__blog__cards__card__date">Aug 20, 2018</div> + </a> + <a class="hero__blog__cards__card" href="/blog/2018/08/10/beam-2.6.0.html"> <div class="hero__blog__cards__card__title">Apache Beam 2.6.0</div> <div class="hero__blog__cards__card__date">Aug 10, 2018</div> @@ -172,11 +177,6 @@ limitations under the License. <div class="hero__blog__cards__card__date">Jun 26, 2018</div> </a> - <a class="hero__blog__cards__card" href="/blog/2018/02/19/beam-2.3.0.html"> - <div class="hero__blog__cards__card__title">Apache Beam 2.3.0</div> - <div class="hero__blog__cards__card__date">Feb 19, 2018</div> - </a> - </div> </div> </div>