[2/4] beam-site git commit: Rewrites the section on Coders to not talk about them as a parsing mechanism

jkff Mon, 15 May 2017 12:16:46 -0700

Rewrites the section on Coders to not talk about them as a parsing mechanism



Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/0d0da026
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/0d0da026
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/0d0da026

Branch: refs/heads/asf-site
Commit: 0d0da0265d8a3ee07493feec835e56efd6acfd85
Parents: 9cc5b22
Author: Eugene Kirpichov <kirpic...@google.com>
Authored: Fri May 12 16:06:09 2017 -0700
Committer: Eugene Kirpichov <kirpic...@google.com>
Committed: Mon May 15 11:28:52 2017 -0700

----------------------------------------------------------------------
 src/documentation/programming-guide.md | 38 ++++++-----------------------
 1 file changed, 8 insertions(+), 30 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/0d0da026/src/documentation/programming-guide.md
----------------------------------------------------------------------
diff --git a/src/documentation/programming-guide.md 
b/src/documentation/programming-guide.md
index 11ec86d..f70e255 100644
--- a/src/documentation/programming-guide.md
+++ b/src/documentation/programming-guide.md
@@ -1175,11 +1175,9 @@ See the  [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built
 
 ## <a name="coders"></a>Data encoding and type safety
 
-When you create or output pipeline data, you'll need to specify how the 
elements in your `PCollection`s are encoded and decoded to and from byte 
strings. Byte strings are used for intermediate storage as well reading from 
sources and writing to sinks. The Beam SDKs use objects called coders to 
describe how the elements of a given `PCollection` should be encoded and 
decoded.
+When Beam runners execute your pipeline, they often need to materialize the 
intermediate data in your `PCollection`s, which requires converting elements to 
and from byte strings. The Beam SDKs use objects called `Coder`s to describe 
how the elements of a given `PCollection` may be encoded and decoded.
 
-### Using coders
-
-You typically need to specify a coder when reading data into your pipeline 
from an external source (or creating pipeline data from local data), and also 
when you output pipeline data to an external sink.
+> Note that coders are unrelated to parsing or formatting data when 
interacting with external data sources or sinks. Such parsing or formatting 
should typically be done explicitly, using transforms such as `ParDo` or 
`MapElements`.
 
 {:.language-java}
 In the Beam SDK for Java, the type `Coder` provides the methods required for 
encoding and decoding data. The SDK for Java provides a number of Coder 
subclasses that work with a variety of standard Java types, such as Integer, 
Long, Double, StringUtf8 and more. You can find all of the available Coder 
subclasses in the [Coder 
package](https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders).
@@ -1187,38 +1185,18 @@ In the Beam SDK for Java, the type `Coder` provides the 
methods required for enc
 {:.language-py}
 In the Beam SDK for Python, the type `Coder` provides the methods required for 
encoding and decoding data. The SDK for Python provides a number of Coder 
subclasses that work with a variety of standard Python types, such as primitive 
types, Tuple, Iterable, StringUtf8 and more. You can find all of the available 
Coder subclasses in the 
[apache_beam.coders](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/coders)
 package.
 
-When you read data into a pipeline, the coder indicates how to interpret the 
input data into a language-specific type, such as integer or string. Likewise, 
the coder indicates how the language-specific types in your pipeline should be 
written into byte strings for an output data sink, or to materialize 
intermediate data in your pipeline.
-
-The Beam SDKs set a coder for every `PCollection` in a pipeline, including 
those generated as output from a transform. Most of the time, the Beam SDKs can 
automatically infer the correct coder for an output `PCollection`.
-
 > Note that coders do not necessarily have a 1:1 relationship with types. For 
 > example, the Integer type can have multiple valid coders, and input and 
 > output data can use different Integer coders. A transform might have 
 > Integer-typed input data that uses BigEndianIntegerCoder, and Integer-typed 
 > output data that uses VarIntCoder.
 
-You can explicitly set a `Coder` when inputting or outputting a `PCollection`. 
You set the `Coder` by <span class="language-java">calling the method 
`.withCoder`</span> <span class="language-py">setting the `coder` 
argument</span> when you apply your pipeline's read or write transform.
-
-Typically, you set the `Coder` when the coder for a `PCollection` cannot be 
automatically inferred, or when you want to use a different coder than your 
pipeline's default. The following example code reads a set of numbers from a 
text file, and sets a `Coder` of type <span 
class="language-java">`TextualIntegerCoder`</span> <span 
class="language-py">`VarIntCoder`</span> for the resulting `PCollection`:
-
-```java
-PCollection<Integer> numbers =
-  p.begin()
-  .apply(TextIO.Read.named("ReadNumbers")
-    .from("gs://my_bucket/path/to/numbers-*.txt")
-    .withCoder(TextualIntegerCoder.of()));
-```
-
-```py
-p = beam.Pipeline()
-numbers = ReadFromText("gs://my_bucket/path/to/numbers-*.txt", 
coder=VarIntCoder())
-```
+### Specifying coders
+The Beam SDKs require a coder for every `PCollection` in your pipeline. In 
most cases, the Beam SDK is able to automatically infer a `Coder` for a 
`PCollection` based on its element type or the transform that produces it, 
however, in some cases the pipeline author will need to specify a `Coder` 
explicitly, or develop a `Coder` for their custom type.
 
 {:.language-java}
-You can set the coder for an existing `PCollection` by using the method 
`PCollection.setCoder`. Note that you cannot call `setCoder` on a `PCollection` 
that has been finalized (e.g. by calling `.apply` on it).
+You can explicitly set the coder for an existing `PCollection` by using the 
method `PCollection.setCoder`. Note that you cannot call `setCoder` on a 
`PCollection` that has been finalized (e.g. by calling `.apply` on it).
 
 {:.language-java}
-You can get the coder for an existing `PCollection` by using the method 
`getCoder`. This method will fail with `anIllegalStateException` if a coder has 
not been set and cannot be inferred for the given `PCollection`.
-
-### Coder inference and default coders
+You can get the coder for an existing `PCollection` by using the method 
`getCoder`. This method will fail with an `IllegalStateException` if a coder 
has not been set and cannot be inferred for the given `PCollection`.
 
-The Beam SDKs require a coder for every `PCollection` in your pipeline. Most 
of the time, however, you do not need to explicitly specify a coder, such as 
for an intermediate `PCollection` produced by a transform in the middle of your 
pipeline. In such cases, the Beam SDKs can infer an appropriate coder from the 
inputs and outputs of the transform used to produce the PCollection.
+Beam SDKs use a variety of mechanisms when attempting to automatically infer 
the `Coder` for a `PCollection`.
 
 {:.language-java}
 Each pipeline object has a `CoderRegistry`. The `CoderRegistry` represents a 
mapping of Java types to the default coders that the pipeline should use for 
`PCollection`s of each type.
@@ -1227,7 +1205,7 @@ Each pipeline object has a `CoderRegistry`. The 
`CoderRegistry` represents a map
 The Beam SDK for Python has a `CoderRegistry` that represents a mapping of 
Python types to the default coder that should be used for `PCollection`s of 
each type.
 
 {:.language-java}
-By default, the Beam SDK for Java automatically infers the `Coder` for the 
elements of an output `PCollection` using the type parameter from the 
transform's function object, such as `DoFn`. In the case of `ParDo`, for 
example, a `DoFn<Integer, String>function` object accepts an input element of 
type `Integer` and produces an output element of type `String`. In such a case, 
the SDK for Java will automatically infer the default `Coder` for the output 
`PCollection<String>` (in the default pipeline `CoderRegistry`, this is 
`StringUtf8Coder`).
+By default, the Beam SDK for Java automatically infers the `Coder` for the 
elements of a `PCollection` produced by a `PTransform` using the type parameter 
from the transform's function object, such as `DoFn`. In the case of `ParDo`, 
for example, a `DoFn<Integer, String>` function object accepts an input element 
of type `Integer` and produces an output element of type `String`. In such a 
case, the SDK for Java will automatically infer the default `Coder` for the 
output `PCollection<String>` (in the default pipeline `CoderRegistry`, this is 
`StringUtf8Coder`).
 
 {:.language-py}
 By default, the Beam SDK for Python automatically infers the `Coder` for the 
elements of an output `PCollection` using the typehints from the transform's 
function object, such as `DoFn`. In the case of `ParDo`, for example a `DoFn` 
with the typehints `@beam.typehints.with_input_types(int)` and 
`@beam.typehints.with_output_types(str)` accepts an input element of type int 
and produces an output element of type str. In such a case, the Beam SDK for 
Python will automatically infer the default `Coder` for the output 
`PCollection` (in the default pipeline `CoderRegistry`, this is `BytesCoder`).

[2/4] beam-site git commit: Rewrites the section on Coders to not talk about them as a parsing mechanism

Reply via email to