alxp1982 commented on code in PR #24488: URL: https://github.com/apache/beam/pull/24488#discussion_r1054001849
########## learning/tour-of-beam/learning-content/java/schema-based-transforms/schema-concept/creating-schema/description.md: ########## @@ -0,0 +1,153 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Overview + +Most structured records share some common characteristics: + +→ They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed tuples - have numerical indices instead. + +→ There is a confined list of primitive types that a field can have. These often match primitive types in most programming languages: int, long, string, etc. + +→ Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +Often records have a nested structure. A nested structure occurs when a field itself has subfields so the type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + +``` +Field Name Field Type +userId STRING +itemId INT64 +shippingAddress ROW(ShippingAddress) +cost INT64 +transactions ARRAY[ROW(Transaction)] +``` + +**ShippingAddress** + +``` +Field Name Field Type +streetAddress STRING +city STRING +state nullable STRING +country STRING +postCode STRING +``` + +**Transaction** + +``` +Field Name Field Type +bank STRING +purchaseAmount DOUBLE +``` + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode Schema rows; Beam uses a special coder to encode schema types. + +### Creating Schemas + +While schemas themselves are language independent, they are designed to embed naturally into the programming languages of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of having Beam understand their element schemas. Review Comment: While schemas are language-independent, they are designed to be embedded naturally into the programming languages supported by Beam SDK. You can continue using Java native types with Beam while taking advantage of schema-based transforms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
