[GitHub] [beam] iemejia opened a new pull request #11010: Fix non correctly formatted class in sdks/java/core
iemejia opened a new pull request #11010: Fix non correctly formatted class in sdks/java/core URL: https://github.com/apache/beam/pull/11010 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] iemejia commented on a change in pull request #11009: [BEAM-9342[ Update bytebuddy to version 1.10.8
iemejia commented on a change in pull request #11009: [BEAM-9342[ Update bytebuddy to version 1.10.8 URL: https://github.com/apache/beam/pull/11009#discussion_r386081545 ## File path: sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java ## @@ -514,9 +514,7 @@ public void proccessElement(ProcessContext c) {} SingleOutput, String> parDo = ParDo.of(fn); // Use the parDo in a pipeline to cause state coders to be inferred. - pipeline - .apply(Create.of(KV.of("input", "value"))) - .apply(parDo); + pipeline.apply(Create.of(KV.of("input", "value"))).apply(parDo); Review comment: changes in this class are unrelated but auto applied by `spotlessApply` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] iemejia opened a new pull request #11009: [BEAM-9342[ Update bytebuddy to version 1.10.8
iemejia opened a new pull request #11009: [BEAM-9342[ Update bytebuddy to version 1.10.8 URL: https://github.com/apache/beam/pull/11009 R: @kennknowles This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
svn commit: r38342 - /dev/beam/vendor/beam-vendor-bytebuddy-1_10_8/
Author: iemejia Date: Sun Mar 1 06:26:09 2020 New Revision: 38342 Log: Move Apache Beam vendored Byte Buddy 1.10.8 v0.1 to release branch Removed: dev/beam/vendor/beam-vendor-bytebuddy-1_10_8/
svn commit: r38341 - /release/beam/vendor/beam-vendor-bytebuddy-1_10_8/
Author: iemejia Date: Sun Mar 1 06:26:02 2020 New Revision: 38341 Log: Move Apache Beam vendored Byte Buddy 1.10.8 v0.1 to release branch Added: release/beam/vendor/beam-vendor-bytebuddy-1_10_8/ release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip (with props) release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512 Added: release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip == Binary file - no diff available. Propchange: release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip -- svn:mime-type = application/octet-stream Added: release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc == --- release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc (added) +++ release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc Sun Mar 1 06:26:02 2020 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCgAdFiEENBVjFynhWzMFGttnCp2vZxO4Y0kFAl5WNx4ACgkQCp2vZxO4 +Y0m7iQ//RlHHKJOLwdcPOv07//1HNpQrMJxBrbvzJ04LxqpqsVGssDR2sUCwb8Jv +2bZUCs3c1U3ddLGTF+z+Rd7lq9ZSfxZEcPzDC+z4k8Q4WuQgMMgMq2aI2P9iCs02 +2IUaghd1/YWpALsqT+LhAop11tGM4DUwb+7qSt9qzb6SfilB0M8qIb8Zyx1LBQWV +1HOzaHH5/QUUcrXKdYXP2znxPq9oRxVVN4KNAuQ1wZ7+TdsUsZJsMVpOYjhYB/h7 +eSsNivnNtapSqhGfhAggSAehY7L9MtTUKCdYdXJbZ8X1qOfa25AJYWF/5qwsU3Jl +hskdBLf9d4MESocUTT/mDhljY1hL63S0CAXH99j/hQ5xGDnnTzHA5slRXkMCPJoz +V9NExgszIxG/5mQNyipeSGkyFC+5BryNp8gCrZoGGQlAlJmMZpl3OgHsojPX5c/6 +64pk7I/vKKjnL/d12l8Y31+1CdYXoUZlUon48jrohtaYF58ja4mGTszruPFuzfx/ +b/dkSVAw1GRHskqxbsGVet/Z7heHfSPDXzkm+Upyu15yO6EDnQolITKMW8TYWx4C +z7urq84Y13+tOEj6HCvOXDovsCIPQTj97YwkStNOOxOP5Ro4nZB0qGu5wjW88HnK +MX5PlfRMHMVeO2vho1AUoajtEMkpFWG8fhRZm4VcMYeKt+jiYf8= +=ojL1 +-END PGP SIGNATURE- Added: release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512 == --- release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512 (added) +++ release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512 Sun Mar 1 06:26:02 2020 @@ -0,0 +1 @@ +bf78b061cee54e5a59148e817ea83cb55960c25807ed1872286a3b8175da2f68644da6557f7f24bcbcccb226c85572fe73b188fc030856c20641a9c7372426e5 apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip
[beam] branch master updated (3b26ebf -> 116c5e8)
This is an automated email from the ASF dual-hosted git repository. iemejia pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/beam.git. from 3b26ebf Adds DisplayData for StateSpecs used by stateful ParDos add 612d3d1 [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types add 116c5e8 Merge pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types No new revisions were added by this update. Summary of changes: .../org/apache/beam/sdk/schemas/SchemaCoder.java | 3 +- .../apache/beam/sdk/schemas/SchemaRegistry.java| 57 -- .../beam/sdk/schemas/SchemaRegistryTest.java | 20 3 files changed, 63 insertions(+), 17 deletions(-)
[GitHub] [beam] iemejia merged pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
iemejia merged pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types URL: https://github.com/apache/beam/pull/10974 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] iemejia commented on a change in pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
iemejia commented on a change in pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types URL: https://github.com/apache/beam/pull/10974#discussion_r386080476 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaRegistry.java ## @@ -288,6 +281,38 @@ public void registerSchemaProvider(SchemaProvider schemaProvider) { return getProviderResult((SchemaProvider p) -> p.fromRowFunction(typeDescriptor)); } + /** + * Retrieve a {@link SchemaCoder} for a given {@link Class} type. If no schema exists, throws + * {@link * NoSuchSchemaException}. + */ + public SchemaCoder getSchemaCoder(Class clazz) throws NoSuchSchemaException { +return getSchemaCoder(TypeDescriptor.of(clazz)); + } + + /** + * Retrieve a {@link SchemaCoder} for a given {@link TypeDescriptor} type. If no schema exists, + * throws {@link * NoSuchSchemaException}. + */ + public SchemaCoder getSchemaCoder(TypeDescriptor typeDescriptor) + throws NoSuchSchemaException { +return SchemaCoder.of( +getSchema(typeDescriptor), +typeDescriptor, +getToRowFunction(typeDescriptor), +getFromRowFunction(typeDescriptor)); Review comment: Interesting. I have not thought about making KafkaRecord 'schema' like good point. There are some consequences on that that are still not clear to me (like how will we deal with the runtime resolution part of Schemas for KV that we do now with the Confluent Schema Registry support. I am going to give it a try and ping you once I have something in the other PR #10978. Let's continue that discussion there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] YYTVicky opened a new pull request #11008: Update comment to tell user this is not secure
YYTVicky opened a new pull request #11008: Update comment to tell user this is not secure URL: https://github.com/apache/beam/pull/11008 **Please** add a meaningful description for your change here Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build S
[GitHub] [beam] matthiasa4 commented on issue #10958: [BEAM] Submitting final communication strategy
matthiasa4 commented on issue #10958: [BEAM] Submitting final communication strategy URL: https://github.com/apache/beam/pull/10958#issuecomment-592980206 LGTM! I think some of the artifacts could also go on the [community part](https://beam.apache.org/community/) of the website if rewritten as guidelines? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on issue #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
reuvenlax commented on issue #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types URL: https://github.com/apache/beam/pull/10974#issuecomment-592962672 lgtm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
reuvenlax commented on a change in pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types URL: https://github.com/apache/beam/pull/10974#discussion_r386038872 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaRegistry.java ## @@ -288,6 +281,38 @@ public void registerSchemaProvider(SchemaProvider schemaProvider) { return getProviderResult((SchemaProvider p) -> p.fromRowFunction(typeDescriptor)); } + /** + * Retrieve a {@link SchemaCoder} for a given {@link Class} type. If no schema exists, throws + * {@link * NoSuchSchemaException}. + */ + public SchemaCoder getSchemaCoder(Class clazz) throws NoSuchSchemaException { +return getSchemaCoder(TypeDescriptor.of(clazz)); + } + + /** + * Retrieve a {@link SchemaCoder} for a given {@link TypeDescriptor} type. If no schema exists, + * throws {@link * NoSuchSchemaException}. + */ + public SchemaCoder getSchemaCoder(TypeDescriptor typeDescriptor) + throws NoSuchSchemaException { +return SchemaCoder.of( +getSchema(typeDescriptor), +typeDescriptor, +getToRowFunction(typeDescriptor), +getFromRowFunction(typeDescriptor)); Review comment: It's useful when integrating schema code with code that does not yet understand schemas. In the KafkaIO example I think that the ideal solution would be to allow a Schema on KafkaRecord (this probably requires us to add Java generic type awareness to schema inference though), in which case they keyCoder and valueCoder isn't needed. I agree that allowing easy inference of SchemaCoder allows for lower-effort integration of schemas in code like this, though hopefully this is just a short-term solution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386038309 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} +While schemas themselves are language independent, they are designed to embed naturally into the programming languages +of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of +having Beam understand their element schemas. + + {:.language-java} + In Java you could use the following set of classes to represent the purchase schema. Beam will automatically + infer the correct schema based on the members of the class. + +```java +@DefaultSchema(JavaBeanSchema.class) +public class Purchase { + public String getUserId(); // Returns the id of the user who made the purchase. + public long getItemId(); // Returns the identifier of the item that was purchased. + public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. + public long getCostCents(); // Returns the cost
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386038301 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} +While schemas themselves are language independent, they are designed to embed naturally into the programming languages +of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of +having Beam understand their element schemas. + + {:.language-java} + In Java you could use the following set of classes to represent the purchase schema. Beam will automatically + infer the correct schema based on the members of the class. + +```java +@DefaultSchema(JavaBeanSchema.class) +public class Purchase { + public String getUserId(); // Returns the id of the user who made the purchase. + public long getItemId(); // Returns the identifier of the item that was purchased. + public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. + public long getCostCents(); // Returns the cost
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386038304 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} +While schemas themselves are language independent, they are designed to embed naturally into the programming languages +of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of +having Beam understand their element schemas. + + {:.language-java} + In Java you could use the following set of classes to represent the purchase schema. Beam will automatically + infer the correct schema based on the members of the class. + +```java +@DefaultSchema(JavaBeanSchema.class) +public class Purchase { + public String getUserId(); // Returns the id of the user who made the purchase. + public long getItemId(); // Returns the identifier of the item that was purchased. + public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. + public long getCostCents(); // Returns the cost
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386038297 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} Review comment: hmmm this section is trying to get across the basic theory that schemas are _the_ types, and if multiple Pl types have the same schema then they are interchangeable. The goal of this section isn't a reference, though we use some examples to illustrate the point. The other two sections you mention are meant to be more of a reference. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386038150 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} +While schemas themselves are language independent, they are designed to embed naturally into the programming languages +of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of +having Beam understand their element schemas. + + {:.language-java} + In Java you could use the following set of classes to represent the purchase schema. Beam will automatically + infer the correct schema based on the members of the class. + +```java +@DefaultSchema(JavaBeanSchema.class) +public class Purchase { + public String getUserId(); // Returns the id of the user who made the purchase. + public long getItemId(); // Returns the identifier of the item that was purchased. + public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. + public long getCostCents(); // Returns the cost
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386037819 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs Review comment: I meant within a pipeline as that's where users create these intermediate objects - maybe SDK pipeline? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386037870 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + Review comment: This is true for other tables in this doc as well (e.g. coders). Maybe we should figure out a way to fix all of them at once? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386037889 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386037750 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a Review comment: I think better to wait until we add Python schema docs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386037654 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} +While schemas themselves are language independent, they are designed to embed naturally into the programming languages +of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of +having Beam understand their element schemas. + + {:.language-java} + In Java you could use the following set of classes to represent the purchase schema. Beam will automatically + infer the correct schema based on the members of the class. + +```java +@DefaultSchema(JavaBeanSchema.class) +public class Purchase { + public String getUserId(); // Returns the id of the user who made the purchase. + public long getItemId(); // Returns the identifier of the item that was purchased. + public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. + public long getCostCents(); // Returns the cost
[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas
reuvenlax commented on a change in pull request #10767: Document Beam Schemas URL: https://github.com/apache/beam/pull/10767#discussion_r386037615 ## File path: website/src/documentation/programming-guide.md ## @@ -1970,7 +1976,1076 @@ records.apply("WriteToText", See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Data encoding and type safety {#data-encoding-and-type-safety} +## 6. Schemas {#schemas} +Often, the type of records being processed have an obvious structure. Common Beam sources produce +JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, +structures that can often be determined by examining the type. Even within a pipeline, Simple Java POJOs +(or equivalent structures in other languages) are often used as intermediate types, and these also have a + clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s + records, we can provide much more concise APIs for data processing. + +### 6.1. What is a schema {#what-is-a-schema} +Most structured records share some common characteristics: +* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed + tuples - have numerical indices instead. +* There is a confined list of primitive types that a field can have. These often match primitive types in most programming + languages: int, long, string, etc. +* Often a field type can be marked as optional (sometimes referred to as nullable) or required. + +In addition, often records have a nested structure. A nested structure occurs when a field itself has subfields so the +type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured +records. + +For example, consider the following schema, representing actions in a fictitious e-commerce company: + +**Purchase** + + + + Field Name + Field Type + + + + + userId + STRING + + + itemId + INT64 + + + shippingAddress + ROW(ShippingAddress) + + + cost + INT64 + + + transactions + ARRAY[ROW(Transaction)] + + + + + +**ShippingAddress** + + + + Field Name + Field Type + + + + + streetAddress + STRING + + + city + STRING + + + state + nullable STRING + + + country + STRING + + + postCode + STRING + + + + + +**Transaction** + + + + Field Name + Field Type + + + + + bank + STRING + + + purchaseAmount + DOUBLE + + + + + +Purchase event records are represented by the aove purchase schema. Each purchase event contains a shipping address, which +is a nested row containing its own schema. Each purchase also contains a list of credit-card transactions +(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row +with its own schema. + +This provides an abstract description of the types involved, one that is abstracted away from any specific programming +language. + +Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There +might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), +and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about +types across different programming-language APIs. + +A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode +Schema rows. + +### 6.2. Schemas for programming language types {#schemas-for-pl-types} +While schemas themselves are language independent, they are designed to embed naturally into the programming languages +of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of +having Beam understand their element schemas. + + {:.language-java} + In Java you could use the following set of classes to represent the purchase schema. Beam will automatically + infer the correct schema based on the members of the class. + +```java +@DefaultSchema(JavaBeanSchema.class) +public class Purchase { + public String getUserId(); // Returns the id of the user who made the purchase. + public long getItemId(); // Returns the identifier of the item that was purchased. + public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. + public long getCostCents(); // Returns the cost
[GitHub] [beam] steveniemitz commented on issue #10852: [BEAM-9308] Decorrelate state cleanup timers
steveniemitz commented on issue #10852: [BEAM-9308] Decorrelate state cleanup timers URL: https://github.com/apache/beam/pull/10852#issuecomment-592959333 > maybe we need to explore the prioritization issue a bit more. Agreed, I think ideally the state cleanup timers would have a (much?) lower priority than everything else so they don't starve out more important "user" work. > Is this. a blocker for. you? If so then. maybe we can add a parameter to DataflowPipelineOptions to control this so we don't take the risk of changing the default behavior without more data. We run our own fork of the anyways, so it's not particularly a blocker here. I mostly just intended this PR as a conversation starter. I am curious about your comment above though ("We currently rely on the state cleanup timer for watermark holds"). From what I've observed in the code, the state cleanup is set for after the window end, so delaying it slightly more shouldn't cause any correctness issues, correct? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] reuvenlax commented on issue #10852: [BEAM-9308] Decorrelate state cleanup timers
reuvenlax commented on issue #10852: [BEAM-9308] Decorrelate state cleanup timers URL: https://github.com/apache/beam/pull/10852#issuecomment-592958092 I'm trying to think of a principled way to do this - maybe we need to explore the prioritization issue a bit more. Is this. a blocker for. you? If so then. maybe we can add a parameter to DataflowPipelineOptions to control this so we don't take the risk of changing the default behavior without more data. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[beam] branch master updated (6e69b26 -> 3b26ebf)
This is an automated email from the ASF dual-hosted git repository. alexvanboxel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/beam.git. from 6e69b26 Merge pull request #11001: [BEAM-9405] Fix post-commit error about create_job_service add 3b26ebf Adds DisplayData for StateSpecs used by stateful ParDos No new revisions were added by this update. Summary of changes: .../apache/beam/sdk/coders/SerializableCoder.java | 5 +++ .../java/org/apache/beam/sdk/transforms/ParDo.java | 47 -- .../org/apache/beam/sdk/transforms/ParDoTest.java | 47 ++ 3 files changed, 96 insertions(+), 3 deletions(-)
[GitHub] [beam] alexvanboxel merged pull request #11004: Adds DisplayData for StateSpecs used by stateful ParDos
alexvanboxel merged pull request #11004: Adds DisplayData for StateSpecs used by stateful ParDos URL: https://github.com/apache/beam/pull/11004 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [beam] chadrik opened a new pull request #11007: [BEAM-7746] Runtime change to timestamp/duration equality
chadrik opened a new pull request #11007: [BEAM-7746] Runtime change to timestamp/duration equality URL: https://github.com/apache/beam/pull/11007 mypy *strongly* recommends that "__eq__" work with arbitrary objects, by returning `NotImplemented` This PR prevents the following mypy errors: ``` apache_beam/utils/timestamp.py:193: error: Argument 1 of "__eq__" is incompatible with supertype "object"; supertype defines the argument type as "object" [override] apache_beam/utils/timestamp.py:193: note: It is recommended for "__eq__" to work with arbitrary objects, for example: apache_beam/utils/timestamp.py:193: note: def __eq__(self, other: object) -> bool: apache_beam/utils/timestamp.py:193: note: if not isinstance(other, Timestamp): apache_beam/utils/timestamp.py:193: note: return NotImplemented apache_beam/utils/timestamp.py:193: note: return apache_beam/utils/timestamp.py:346: error: Argument 1 of "__eq__" is incompatible with supertype "object"; supertype defines the argument type as "object" [override] apache_beam/utils/timestamp.py:346: note: It is recommended for "__eq__" to work with arbitrary objects, for example: apache_beam/utils/timestamp.py:346: note: def __eq__(self, other: object) -> bool: apache_beam/utils/timestamp.py:346: note: if not isinstance(other, Duration): apache_beam/utils/timestamp.py:346: note: return NotImplemented apache_beam/utils/timestamp.py:346: note: return ``` I wrote an equality test in `timestamp_test.py` awhile back to cover this case. For reference, it looks like this: ```python def test_equality(self): for min_val in (Timestamp(1), Duration(1), 1, 1.1): for max_val in (Timestamp(123), Duration(123), 123, 123.4): self.assertTrue(min_val < max_val, "%s < %s" % (min_val, max_val)) self.assertTrue(min_val <= max_val, "%s <= %s" % (min_val, max_val)) self.assertTrue(max_val > min_val, "%s > %s" % (max_val, min_val)) self.assertTrue(max_val >= min_val, "%s >= %s" % (max_val, min_val)) ``` Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostComm