[GitHub] [beam] alxp1982 commented on a diff in pull request #24488: add schema-based trnasforms

GitBox Tue, 20 Dec 2022 23:52:03 -0800


alxp1982 commented on code in PR #24488:
URL: https://github.com/apache/beam/pull/24488#discussion_r1053877896



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.

Review Comment:
   List isn't rendered correctly



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.

Review Comment:
   Schemas provide us with a type system for Beam records that is independent 
of any specific programming-language type. There might be multiple types of 
Java objects that all have the same schema. For example, you can implement the 
same schema as Protocol-Buffer or POJO class. 
   
   Schemas also provide a simple way to reason about types across different 
programming-language APIs.



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+

Review Comment:
   In this example, shippingAddress and transactions fields of the Purchase 
type, are nested types: ShippingAddress and Transaction. 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.
+
+There are a couple of other useful annotations that affect how Beam infers 
schemas. By default the schema field names inferred will match that of the 
class field names. However `@SchemaFieldName` can be used to specify a 
different name to be used for the schema field. @SchemaIgnore can be used to 
mark specific class fields as excluded from the inferred schema. For example, 
it’s common to have ephemeral fields in a class that should not be included in 
a schema (e.g. caching the hash value to prevent expensive recomputation of the 
hash), and @SchemaIgnore can be used to exclude these fields. Note that ignored 
fields will not be included in the encoding of these records.
+
+In some cases it is not convenient to annotate the POJO class, for example if 
the POJO is in a different package that is not owned by the Beam pipeline 
author. In these cases the schema inference can be triggered programmatically 
in pipeline’s main function as follows:
+
+```
+pipeline.getSchemaRegistry().registerPOJO(TransactionPOJO.class);
+```
+
+#### Java Beans
+
+Java Beans are a de-facto standard for creating reusable property classes in 
Java. While the full standard has many characteristics, the key ones are that 
all properties are accessed via getter and setter classes, and the name format 
for these getters and setters is standardized. A Java Bean class can be 
annotated with `@DefaultSchema(JavaBeanSchema.class)` and Beam will 
automatically infer a schema for this class.
+
+The `@SchemaCreate` annotation can be used to specify a constructor or a 
static factory method, in which case the setters and zero-argument constructor 
can be omitted.
+
+```
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost of the item.
+  public List<Transaction> getTransactions();  // Returns the transactions 
that paid for this purchase (returns a list, since the purchase might be spread 
out over multiple credit cards).
+
+  @SchemaCreate
+  public Purchase(String userId, long itemId, ShippingAddress shippingAddress, 
long costCents, List<Transaction> transactions) {
+      ...
+  }
+}
+```
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred, just like with `POJO` classes.
+
+#### AutoValue
+
+Java value classes are notoriously difficult to generate correctly. There is a 
lot of boilerplate you must create in order to properly implement a value 
class. `AutoValue` is a popular library for easily generating such classes by 
implementing a simple abstract base class.
+
+Beam can infer a schema from an `AutoValue` class. For example:
+
+```
+@DefaultSchema(AutoValueSchema.class)
+@AutoValue
+public abstract class ShippingAddress {
+  public abstract String streetAddress();
+  public abstract String city();
+  public abstract String state();
+  public abstract String country();
+  public abstract String postCode();
+}
+```
+
+This is all that’s needed to generate a simple `AutoValue` class, and the 
above `@DefaultSchema` annotation tells Beam to infer a schema from it. This 
also allows AutoValue elements to be used inside of `PCollections`.
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred.
+
+### Playground exercise
+
+You can find the complete code of this example in the playground window you 
can run and experiment with.
+
+One of the differences you will notice is that it also contains the part to 
output `PCollection` elements to the console.

Review Comment:
   Example is empty, and where is the challenge? 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.
+
+There are a couple of other useful annotations that affect how Beam infers 
schemas. By default the schema field names inferred will match that of the 
class field names. However `@SchemaFieldName` can be used to specify a 
different name to be used for the schema field. @SchemaIgnore can be used to 
mark specific class fields as excluded from the inferred schema. For example, 
it’s common to have ephemeral fields in a class that should not be included in 
a schema (e.g. caching the hash value to prevent expensive recomputation of the 
hash), and @SchemaIgnore can be used to exclude these fields. Note that ignored 
fields will not be included in the encoding of these records.
+
+In some cases it is not convenient to annotate the POJO class, for example if 
the POJO is in a different package that is not owned by the Beam pipeline 
author. In these cases the schema inference can be triggered programmatically 
in pipeline’s main function as follows:
+
+```
+pipeline.getSchemaRegistry().registerPOJO(TransactionPOJO.class);
+```
+
+#### Java Beans
+
+Java Beans are a de-facto standard for creating reusable property classes in 
Java. While the full standard has many characteristics, the key ones are that 
all properties are accessed via getter and setter classes, and the name format 
for these getters and setters is standardized. A Java Bean class can be 
annotated with `@DefaultSchema(JavaBeanSchema.class)` and Beam will 
automatically infer a schema for this class.
+
+The `@SchemaCreate` annotation can be used to specify a constructor or a 
static factory method, in which case the setters and zero-argument constructor 
can be omitted.
+
+```
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost of the item.
+  public List<Transaction> getTransactions();  // Returns the transactions 
that paid for this purchase (returns a list, since the purchase might be spread 
out over multiple credit cards).
+
+  @SchemaCreate
+  public Purchase(String userId, long itemId, ShippingAddress shippingAddress, 
long costCents, List<Transaction> transactions) {
+      ...
+  }
+}
+```
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred, just like with `POJO` classes.
+
+#### AutoValue
+
+Java value classes are notoriously difficult to generate correctly. There is a 
lot of boilerplate you must create in order to properly implement a value 
class. `AutoValue` is a popular library for easily generating such classes by 
implementing a simple abstract base class.
+
+Beam can infer a schema from an `AutoValue` class. For example:
+
+```
+@DefaultSchema(AutoValueSchema.class)
+@AutoValue
+public abstract class ShippingAddress {
+  public abstract String streetAddress();
+  public abstract String city();
+  public abstract String state();
+  public abstract String country();
+  public abstract String postCode();
+}
+```
+
+This is all that’s needed to generate a simple `AutoValue` class, and the 
above `@DefaultSchema` annotation tells Beam to infer a schema from it. This 
also allows AutoValue elements to be used inside of `PCollections`.
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred.

Review Comment:
   This is all that’s needed to generate a simple `AutoValue` class, and the 
above `@DefaultSchema` annotation tells Beam to infer a schema from it. This 
also allows AutoValue elements to be used inside of `PCollections`.
                
   You can also use `@SchemaFieldName` and `@SchemaIgnore` annotations to 
specify different schema field names or ignore fields, respectively. 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.
+
+There are a couple of other useful annotations that affect how Beam infers 
schemas. By default the schema field names inferred will match that of the 
class field names. However `@SchemaFieldName` can be used to specify a 
different name to be used for the schema field. @SchemaIgnore can be used to 
mark specific class fields as excluded from the inferred schema. For example, 
it’s common to have ephemeral fields in a class that should not be included in 
a schema (e.g. caching the hash value to prevent expensive recomputation of the 
hash), and @SchemaIgnore can be used to exclude these fields. Note that ignored 
fields will not be included in the encoding of these records.
+
+In some cases it is not convenient to annotate the POJO class, for example if 
the POJO is in a different package that is not owned by the Beam pipeline 
author. In these cases the schema inference can be triggered programmatically 
in pipeline’s main function as follows:
+
+```
+pipeline.getSchemaRegistry().registerPOJO(TransactionPOJO.class);
+```
+
+#### Java Beans
+
+Java Beans are a de-facto standard for creating reusable property classes in 
Java. While the full standard has many characteristics, the key ones are that 
all properties are accessed via getter and setter classes, and the name format 
for these getters and setters is standardized. A Java Bean class can be 
annotated with `@DefaultSchema(JavaBeanSchema.class)` and Beam will 
automatically infer a schema for this class.

Review Comment:
   Java Beans are a de-facto standard for creating reusable property classes in 
Java. While the full standard has many characteristics, the key ones are that 
all fields must be accessed using getters and setters, and the name format for 
these getters and setters is standardized. A Java Bean class can be annotated 
with `@DefaultSchema(JavaBeanSchema.class),` and Beam will automatically infer 
a schema for this class.
   
   Similarly to POJO classes, you can use `@SchemaCreate` annotation to specify 
a constructor or a static factory method. Otherwise, Beam will use zero 
arguments constructor and setters to instantiate the class. 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.

Review Comment:
   Often records have a nested structure. A nested structure occurs when a 
field itself has subfields, so the type of the field itself has a schema. 
Fields that are array or map types are also a common feature of these 
structured records.



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.

Review Comment:
   If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported, as are List, array, and Map fields.



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.

Review Comment:
   For the above e-commerce example, you can implement the `Purchase` schema 
using one of the following sets of Java classes: 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.
+
+There are a couple of other useful annotations that affect how Beam infers 
schemas. By default the schema field names inferred will match that of the 
class field names. However `@SchemaFieldName` can be used to specify a 
different name to be used for the schema field. @SchemaIgnore can be used to 
mark specific class fields as excluded from the inferred schema. For example, 
it’s common to have ephemeral fields in a class that should not be included in 
a schema (e.g. caching the hash value to prevent expensive recomputation of the 
hash), and @SchemaIgnore can be used to exclude these fields. Note that ignored 
fields will not be included in the encoding of these records.
+
+In some cases it is not convenient to annotate the POJO class, for example if 
the POJO is in a different package that is not owned by the Beam pipeline 
author. In these cases the schema inference can be triggered programmatically 
in pipeline’s main function as follows:
+
+```
+pipeline.getSchemaRegistry().registerPOJO(TransactionPOJO.class);
+```
+
+#### Java Beans
+
+Java Beans are a de-facto standard for creating reusable property classes in 
Java. While the full standard has many characteristics, the key ones are that 
all properties are accessed via getter and setter classes, and the name format 
for these getters and setters is standardized. A Java Bean class can be 
annotated with `@DefaultSchema(JavaBeanSchema.class)` and Beam will 
automatically infer a schema for this class.
+
+The `@SchemaCreate` annotation can be used to specify a constructor or a 
static factory method, in which case the setters and zero-argument constructor 
can be omitted.
+
+```
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost of the item.
+  public List<Transaction> getTransactions();  // Returns the transactions 
that paid for this purchase (returns a list, since the purchase might be spread 
out over multiple credit cards).
+
+  @SchemaCreate
+  public Purchase(String userId, long itemId, ShippingAddress shippingAddress, 
long costCents, List<Transaction> transactions) {
+      ...
+  }
+}
+```
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred, just like with `POJO` classes.
+
+#### AutoValue
+
+Java value classes are notoriously difficult to generate correctly. There is a 
lot of boilerplate you must create in order to properly implement a value 
class. `AutoValue` is a popular library for easily generating such classes by 
implementing a simple abstract base class.
+
+Beam can infer a schema from an `AutoValue` class. For example:
+
+```
+@DefaultSchema(AutoValueSchema.class)
+@AutoValue
+public abstract class ShippingAddress {
+  public abstract String streetAddress();
+  public abstract String city();
+  public abstract String state();
+  public abstract String country();
+  public abstract String postCode();
+}
+```
+
+This is all that’s needed to generate a simple `AutoValue` class, and the 
above `@DefaultSchema` annotation tells Beam to infer a schema from it. This 
also allows AutoValue elements to be used inside of `PCollections`.
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred.
+
+### Playground exercise
+
+You can find the complete code of this example in the playground window you 
can run and experiment with.
+
+One of the differences you will notice is that it also contains the part to 
output `PCollection` elements to the console.
+
+Do you also notice in what order elements of PCollection appear in the 
console? Why is that? You can also run the example several times to see if the 
output stays the same or changes.

Review Comment:
   'Do you also notice in what order elements of PCollection appear in the 
console? Why is that? You can also run the example several times to see if the 
output stays the same or changes.' isn't applicable at this stage; please 
remove it. 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.
+
+There are a couple of other useful annotations that affect how Beam infers 
schemas. By default the schema field names inferred will match that of the 
class field names. However `@SchemaFieldName` can be used to specify a 
different name to be used for the schema field. @SchemaIgnore can be used to 
mark specific class fields as excluded from the inferred schema. For example, 
it’s common to have ephemeral fields in a class that should not be included in 
a schema (e.g. caching the hash value to prevent expensive recomputation of the 
hash), and @SchemaIgnore can be used to exclude these fields. Note that ignored 
fields will not be included in the encoding of these records.

Review Comment:
   A couple of other useful annotations affect how Beam infers schemas. By 
default, the schema field names will match that of the class field names. 
However, `@SchemaFieldName` can be used to specify a different name to be used 
for the schema field. 
   
   You can use @SchemaIgnore to mark specific class fields as excluded from the 
inferred schema. For example, it’s common to have ephemeral fields in a class 
that should not be included in a schema (e.g., caching the hash value to 
prevent expensive recomputation of the hash), and @SchemaIgnore allows to 
exclude such fields. Note that ignored fields will be excluded from encoding as 
well. 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.

Review Comment:
   While schemas are language-independent, they are designed to be embedded 
naturally into the programming languages supported by Beam SDK. You can 
continue using Java native types with Beam while taking advantage of 
schema-based transforms. 



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/logical-type/description.md:
##########
@@ -0,0 +1,102 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Logical types
+
+Users can extend the schema type system to add custom logical types that can 
be used as a field. A logical type is identified by a unique identifier and an 
argument. A logical type also specifies an underlying schema type to be used 
for storage, along with conversions to and from that type. As an example, a 
logical union can always be represented as a row with nullable fields, where 
the user ensures that only one of those fields is ever set at a time.

Review Comment:
   There may be cases when you need to extend the schema type system to add 
custom logical types. A unique identifier and an argument identify a logical 
type. Apart from defining the underlying schema type for storage, you also need 
to implement to and from type conversions. For example, you can represent the 
union logical type as a row with nullable fields, with only one field set at a 
time. 
   
   In Java, you need to subclass from LogicalType class to implement the 
logical type. In addition, you will also need to implement to and from 
underlying Schema type conversions by overriding toBaseTpe and toInputType 
methods, respectively. 
   
   For example, the logical type representing nanosecond timestamp might be 
implemented as follows



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.

Review Comment:
   You can use `@SchemaCreate`annotation to tell Beam to use the annotated 
constructor to create class instance, assuming constructor parameters have the 
same names as fields. You can also use `@SchemaCreate` to annotate static 
factory methods on the class, allowing the constructor to remain private. If 
there is no @SchemaCreate annotation, then all the fields must be non-final, 
and the class must have a zero-argument constructor.



##########
learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md:
##########
@@ -0,0 +1,153 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Overview
+
+Most structured records share some common characteristics:
+
+→  They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed tuples - have numerical 
indices instead.
+
+→  There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming languages: int, long, string, 
etc.
+
+→  Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+Often records have a nested structure. A nested structure occurs when a field 
itself has subfields so the type of the field itself has a schema. Fields that 
are array or map types is also a common feature of these structured records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+```
+Field Name              Field Type
+userId                  STRING
+itemId                  INT64
+shippingAddress         ROW(ShippingAddress)
+cost                    INT64
+transactions            ARRAY[ROW(Transaction)]
+```
+
+**ShippingAddress**
+
+```
+Field Name              Field Type
+streetAddress           STRING
+city                    STRING
+state                   nullable STRING
+country                 STRING
+postCode                STRING
+```
+
+**Transaction**
+
+```
+Field Name              Field Type
+bank                    STRING
+purchaseAmount          DOUBLE
+```
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There might be multiple Java classes that 
all have the same schema (for example a Protocol-Buffer class or a POJO class), 
and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about types across different 
programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode Schema rows; Beam uses a special coder to 
encode schema types.
+
+### Creating Schemas
+
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages of the Beam SDK being used. This 
allows Beam users to continue using native types while reaping the advantage of 
having Beam understand their element schemas.
+
+In Java you could use the following set of classes to represent the purchase 
schema. Beam will automatically infer the correct schema based on the members 
of the class.
+
+#### Java POJOs
+
+A `POJO` (Plain Old Java Object) is a Java object that is not bound by any 
restriction other than the Java Language Specification. A `POJO` can contain 
member variables that are primitives, that are other POJOs, or are collections 
maps or arrays thereof. `POJO`s do not have to extend prespecified classes or 
extend any specific interfaces.
+
+If a `POJO` class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, 
Beam will automatically infer a schema for this class. Nested classes are 
supported as are classes with List, array, and Map fields.
+
+For example, annotating the following class tells Beam to infer a schema from 
this `POJO` class and apply it to any `PCollection<TransactionPojo>`.
+
+```
+@DefaultSchema(JavaFieldSchema.class)
+public class TransactionPojo {
+  public final String bank;
+  public final double purchaseAmount;
+  @SchemaCreate
+  public TransactionPojo(String bank, double purchaseAmount) {
+    this.bank = bank;
+    this.purchaseAmount = purchaseAmount;
+  }
+}
+// Beam will automatically infer the correct schema for this PCollection. No 
coder is needed as a result.
+PCollection<TransactionPojo> pojos = readPojos();
+```
+The `@SchemaCreate` annotation tells Beam that this constructor can be used to 
create instances of `TransactionPojo`, assuming that constructor parameters 
have the same names as the field names. `@SchemaCreate` can also be used to 
annotate static factory methods on the class, allowing the constructor to 
remain private. If there is no @SchemaCreate annotation then all the fields 
must be non-final and the class must have a zero-argument constructor.
+
+There are a couple of other useful annotations that affect how Beam infers 
schemas. By default the schema field names inferred will match that of the 
class field names. However `@SchemaFieldName` can be used to specify a 
different name to be used for the schema field. @SchemaIgnore can be used to 
mark specific class fields as excluded from the inferred schema. For example, 
it’s common to have ephemeral fields in a class that should not be included in 
a schema (e.g. caching the hash value to prevent expensive recomputation of the 
hash), and @SchemaIgnore can be used to exclude these fields. Note that ignored 
fields will not be included in the encoding of these records.
+
+In some cases it is not convenient to annotate the POJO class, for example if 
the POJO is in a different package that is not owned by the Beam pipeline 
author. In these cases the schema inference can be triggered programmatically 
in pipeline’s main function as follows:
+
+```
+pipeline.getSchemaRegistry().registerPOJO(TransactionPOJO.class);
+```
+
+#### Java Beans
+
+Java Beans are a de-facto standard for creating reusable property classes in 
Java. While the full standard has many characteristics, the key ones are that 
all properties are accessed via getter and setter classes, and the name format 
for these getters and setters is standardized. A Java Bean class can be 
annotated with `@DefaultSchema(JavaBeanSchema.class)` and Beam will 
automatically infer a schema for this class.
+
+The `@SchemaCreate` annotation can be used to specify a constructor or a 
static factory method, in which case the setters and zero-argument constructor 
can be omitted.
+
+```
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost of the item.
+  public List<Transaction> getTransactions();  // Returns the transactions 
that paid for this purchase (returns a list, since the purchase might be spread 
out over multiple credit cards).
+
+  @SchemaCreate
+  public Purchase(String userId, long itemId, ShippingAddress shippingAddress, 
long costCents, List<Transaction> transactions) {
+      ...
+  }
+}
+```
+
+`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema 
inferred, just like with `POJO` classes.
+
+#### AutoValue
+
+Java value classes are notoriously difficult to generate correctly. There is a 
lot of boilerplate you must create in order to properly implement a value 
class. `AutoValue` is a popular library for easily generating such classes by 
implementing a simple abstract base class.

Review Comment:
   Java value classes are notoriously difficult to generate correctly. This is 
because there are a lot of boilerplates you must create to implement a value 
class properly. `AutoValue` is a popular library to simplify simple class 
creation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] alxp1982 commented on a diff in pull request #24488: add schema-based trnasforms

Reply via email to