[ 
https://issues.apache.org/jira/browse/BEAM-6347?focusedWorklogId=182141&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-182141
 ]

ASF GitHub Bot logged work on BEAM-6347:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Jan/19 02:43
            Start Date: 08/Jan/19 02:43
    Worklog Time Spent: 10m 
      Work Description: chamikaramj commented on pull request #7397: 
[BEAM-6347] Add website page for developing I/O connectors for Java
URL: https://github.com/apache/beam/pull/7397#discussion_r245861647
 
 

 ##########
 File path: website/src/documentation/io/developing-io-java.md
 ##########
 @@ -0,0 +1,372 @@
+---
+layout: section
+title: "Apache Beam: Developing I/O connectors for Java"
+section_menu: section-menu/documentation.html
+permalink: /documentation/io/developing-io-java/
+redirect_from: /documentation/io/authoring-java/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Developing I/O connectors for Java
+
+To connect to a data store that isn’t supported by Beam’s existing I/O
+connectors, you must create a custom I/O connector that usually consist of a
+source and a sink. All Beam sources and sinks are composite transforms; 
however,
+the implementation of your custom I/O depends on your use case. Before you
+start, read the
+[new I/O connector overview]({{ site.baseurl 
}}/documentation/io/developing-io-overview/)
+for an overview of developing a new I/O connector, the available implementation
+options, and how to choose the right option for your use case.
+
+This guide covers using the `Source` and `FileBasedSink` interfaces using Java.
+The Python SDK offers the same functionality, but uses a slightly different 
API.
+See [Developing I/O connectors for Python]({{ site.baseurl 
}}/documentation/io/developing-io-python/)
+for information specific to the Python SDK.
+
+## Basic code requirements {#basic-code-reqs}
+
+Beam runners use the classes you provide to read and/or write data using
+multiple worker instances in parallel. As such, the code you provide for
+`Source` and `FileBasedSink` subclasses must meet some basic requirements:
+
+  1. **Serializability:** Your `Source` or `FileBasedSink` subclass, whether
+     bounded or unbounded, must be Serializable. A runner might create multiple
+     instances of your `Source` or `FileBasedSink` subclass to be sent to
+     multiple remote workers to facilitate reading or writing in parallel.  
+
+  1. **Immutability:**
+     Your `Source` or `FileBasedSink` subclass must be effectively immutable.
+     All private fields must be declared final, and all private variables of
+     collection type must be effectively immutable. If your class has setter
+     methods, those methods must return an independent copy of the object with
+     the relevant field modified.  
+
+     You should only use mutable state in your `Source` or `FileBasedSink`
+     subclass if you are using lazy evaluation of expensive computations that
+     you need to implement the source or sink; in that case, you must declare
+     all mutable instance variables transient.  
+
+  1. **Thread-Safety:** Your code must be thread-safe. If you build your source
+     to work with dynamic work rebalancing, it is critical that you make your
+     code thread-safe. The Beam SDK provides a helper class to make this 
easier.
+     See [Using Your BoundedSource with dynamic work 
rebalancing](#bounded-dynamic)
+     for more details.  
+
+  1. **Testability:** It is critical to exhaustively unit test all of your
+     `Source` and `FileBasedSink` subclasses, especially if you build your
+     classes to work with advanced features such as dynamic work rebalancing. A
+     minor implementation error can lead to data corruption or data loss (such
+     as skipping or duplicating records) that can be hard to detect.  
+
+     To assist in testing `BoundedSource` implementations, you can use the
+     SourceTestUtils class. `SourceTestUtils` contains utilities for 
automatically
+     verifying some of the properties of your `BoundedSource` implementation. 
You
+     can use `SourceTestUtils` to increase your implementation's test coverage
+     using a wide range of inputs with relatively few lines of code. For
+     examples that use `SourceTestUtils`, see the
+     
[AvroSourceTest](https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/io/AvroSourceTest.java)
 and
+     
[TextIOReadTest](https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOReadTest.java)
+     source code.
+
+In addition, see the [PTransform style guide]({{ site.baseurl 
}}/contribute/ptransform-style-guide/)
+for Beam's transform style guidance.
+
+## Implementing the Source interface
+
+To create a data source for your pipeline, you must provide the format-specific
+logic that tells a runner how to read data from your input source, and how to
+split your data source into multiple parts so that multiple worker instances 
can
+read your data in parallel. If you're creating a data source that reads
+unbounded data, you must provide additional logic for managing your source's
+watermark and optional checkpointing.
+
+Supply the logic for your source by creating the following classes:
+
+  * A subclass of `BoundedSource` if you want to read a finite (batch) data 
set,
+    or a subclass of `UnboundedSource` if you want to read an infinite 
(streaming)
+    data set. These subclasses describe the data you want to read, including 
the
+    data's location and parameters (such as how much data to read).  
+
+  * A subclass of `Source.Reader`. Each Source must have an associated Reader 
that
+    captures all the state involved in reading from that `Source`. This can
+    include things like file handles, RPC connections, and other parameters 
that
+    depend on the specific requirements of the data format you want to read.  
+
+  * The `Reader` class hierarchy mirrors the Source hierarchy. If you're 
extending
+    `BoundedSource`, you'll need to provide an associated `BoundedReader`. if 
you're
+    extending `UnboundedSource`, you'll need to provide an associated
+    `UnboundedReader`.
+
+  * One or more user-facing wrapper composite transforms (`PTransform`) that
+    wrap read operations. [PTransform wrappers](#ptransform-wrappers) discusses
+    why you should avoid exposing your sources.
+
+
+### Implementing the Source subclass
+
+You must create a subclass of either `BoundedSource` or `UnboundedSource`,
+depending on whether your data is a finite batch or an infinite stream. In
+either case, your `Source` subclass must override the abstract methods in the
+superclass. A runner might call these methods when using your data source. For
+example, when reading from a bounded source, a runner uses these methods to
+estimate the size of your data set and to split it up for parallel reading.
+
+Your `Source` subclass should also manage basic information about your data
+source, such as the location. For example, the example `Source` implementation
+in Beam’s 
[DatastoreIO](https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/io/gcp/datastore/DatastoreIO.html)
+class takes host, datasetID, and query as arguments. The connector uses these
+values to obtain data from Cloud Datastore.
+
+#### BoundedSource
+
+`BoundedSource` represents a finite data set from which a Beam runner may read,
+possibly in parallel. `BoundedSource` contains a set of abstract methods that
+the runner uses to split the data set for reading by multiple workers.
+
+To implement a `BoundedSource`, your subclass must override the following
+abstract methods:
+
+  * `splitIntoBundles`: The runner uses this method to split your finite data
+    into bundles of a given size.  
+
+  * `getEstimatedSizeBytes`: The runner uses this method to estimate the total
+    size of your data, in bytes.  
+
+  * `producesSortedKeys`: A method to tell the runner whether your source
+    produces key/value pairs in sorted order. If your source doesn't produce
+    key/value pairs, your implementation of this method must return false.  
+
+  * `createReader`: Creates the associated `BoundedReader` for this
+    `BoundedSource`.
+
+You can see a model of how to implement `BoundedSource` and the required
+abstract methods in Beam’s implementations for Cloud BigTable
+([BigtableIO.java](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java))
+and BigQuery 
([BigQuerySourceBase.java](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java)).
+
+#### UnboundedSource
+
+`UnboundedSource` represents an infinite data stream from which the runner may
+read, possibly in parallel. `UnboundedSource` contains a set of abstract 
methods
+that the runner uses to support streaming reads in parallel; these include
+*checkpointing* for failure recovery, *record IDs* to prevent data duplication,
+and *watermarking* for estimating data completeness in downstream parts of your
+pipeline.
+
+To implement an `UnboundedSource`, your subclass must override the following
+abstract methods:
+
+  * `generateInitialSplits`: The runner uses this method to generate a list of
 
 Review comment:
   Also was renamed to split(): 
https://github.com/apache/beam/blob/release-2.9.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java#L69
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 182141)

> Add page for developing I/O connectors for Java
> -----------------------------------------------
>
>                 Key: BEAM-6347
>                 URL: https://issues.apache.org/jira/browse/BEAM-6347
>             Project: Beam
>          Issue Type: Bug
>          Components: website
>            Reporter: Melissa Pashniak
>            Assignee: Melissa Pashniak
>            Priority: Minor
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to