Hello Everyone,
*For those new to Beam, even if this is your first day, consider yourselves
a welcome contributor to this conversation. I remember what it was like
first learning Beam on my own and I am passionate about everyone's learning
experience. Below are definitions/references and a suggested learning path
to understand this email.*
*Short Version (assumes Beam Java SDK knowledge)*: Recent pull request [1]
completes the write side CSV support using both FileIO.Write [2] and Apache
Commons CSV [3] and adds a new PayloadSerializerProvider [4]. This work
serves as a dependency for File Write Schema Transform [5].
*Long Version (for those first learning Beam)*:
The referenced pull request [1] enables us to write data elements in our
pipeline to file and object systems in CSV format. In Beam, we use a
Schema [6] to describe the properties and data types of the elements.
Typically, those Schema described data elements are Rows [7] but they don't
have to be. The Java SDK provides ways for us to generate a Schema based
on our own custom classes. The aforementioned pull request supports both
Rows and said user defined Java classes.
When the pull request merges, we will be able to:
PCollection<Row> data = ...
data.apply(*CsvIO.writeRows().to("...")*);
Or for my custom Java class:
PCollection<MyInterestingClass> data = ...
data.apply(*CsvIO.<MyInterestingClass>write().to("...")*);
Under the hood, the transform auto-generates the header from the Schema and
using the CSV commons library determines how to convert the data to CSV
lines in the files. Beam file writing typically happens in what are called
shards instead of a single file. Therefore, the header will be written to
the beginning of each "shard" or file.
*Definitions/References*:
1. https://github.com/apache/beam/pull/24630
2. *FileIO.Write* - A PTransform for writing to file and object systems.
See
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.Write.html
3. *Apache Commons CSV* - reads and writes files in variations of the Comma
Separated Value (CSV) format
See https://commons.apache.org/proper/commons-csv/
4. *PayloadSerializerProvider* - a provider for a PayloadSerializer based
on a Schema and optional parameters.
See
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/io/payloads/PayloadSerializerProvider.html
5. *File Write Schema Transform* - a Schema Transform implementation for
writing to file and object systems. The link below contains a deeper
explanation and suggested learning guide.
See
https://docs.google.com/document/d/1IOZrQ4qQrUS2WwQhadN35vX4AzhG4dyXMk1J-R1qJ9c/edit?usp=sharing
.
6. *Schema* - the way we describe data elements' properties and their data
types in Beam.
See
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/Schema.html
7. *Row* - the data element described by a Schema [6].
See
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html
*Suggested Learning Path To Understand This Email*:
1. https://beam.apache.org/documentation/programming-guide/#overview
2.
https://beam.apache.org/documentation/programming-guide/#pcollection-characteristics
3. https://beam.apache.org/documentation/programming-guide/#transforms (Up
to 4.1)
4.
https://beam.apache.org/documentation/programming-guide/#file-based-writing-multiple-files
5. https://beam.apache.org/documentation/programming-guide/#schemas