[GitHub] [beam] iemejia opened a new pull request #11010: Fix non correctly formatted class in sdks/java/core

2020-02-29 Thread GitBox
iemejia opened a new pull request #11010: Fix non correctly formatted class in 
sdks/java/core
URL: https://github.com/apache/beam/pull/11010
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] iemejia commented on a change in pull request #11009: [BEAM-9342[ Update bytebuddy to version 1.10.8

2020-02-29 Thread GitBox
iemejia commented on a change in pull request #11009: [BEAM-9342[ Update 
bytebuddy to version 1.10.8
URL: https://github.com/apache/beam/pull/11009#discussion_r386081545
 
 

 ##
 File path: 
sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java
 ##
 @@ -514,9 +514,7 @@ public void proccessElement(ProcessContext c) {}
   SingleOutput, String> parDo = ParDo.of(fn);
 
   // Use the parDo in a pipeline to cause state coders to be inferred.
-  pipeline
-  .apply(Create.of(KV.of("input", "value")))
-  .apply(parDo);
+  pipeline.apply(Create.of(KV.of("input", "value"))).apply(parDo);
 
 Review comment:
   changes in this class are unrelated but auto applied by `spotlessApply`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] iemejia opened a new pull request #11009: [BEAM-9342[ Update bytebuddy to version 1.10.8

2020-02-29 Thread GitBox
iemejia opened a new pull request #11009: [BEAM-9342[ Update bytebuddy to 
version 1.10.8
URL: https://github.com/apache/beam/pull/11009
 
 
   R: @kennknowles  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


svn commit: r38342 - /dev/beam/vendor/beam-vendor-bytebuddy-1_10_8/

2020-02-29 Thread iemejia
Author: iemejia
Date: Sun Mar  1 06:26:09 2020
New Revision: 38342

Log:
Move Apache Beam vendored Byte Buddy 1.10.8 v0.1 to release branch

Removed:
dev/beam/vendor/beam-vendor-bytebuddy-1_10_8/



svn commit: r38341 - /release/beam/vendor/beam-vendor-bytebuddy-1_10_8/

2020-02-29 Thread iemejia
Author: iemejia
Date: Sun Mar  1 06:26:02 2020
New Revision: 38341

Log:
Move Apache Beam vendored Byte Buddy 1.10.8 v0.1 to release branch

Added:
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/

release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip
   (with props)

release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc

release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512

Added: 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip
==
Binary file - no diff available.

Propchange: 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip
--
svn:mime-type = application/octet-stream

Added: 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc
==
--- 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc
 (added)
+++ 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.asc
 Sun Mar  1 06:26:02 2020
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCgAdFiEENBVjFynhWzMFGttnCp2vZxO4Y0kFAl5WNx4ACgkQCp2vZxO4
+Y0m7iQ//RlHHKJOLwdcPOv07//1HNpQrMJxBrbvzJ04LxqpqsVGssDR2sUCwb8Jv
+2bZUCs3c1U3ddLGTF+z+Rd7lq9ZSfxZEcPzDC+z4k8Q4WuQgMMgMq2aI2P9iCs02
+2IUaghd1/YWpALsqT+LhAop11tGM4DUwb+7qSt9qzb6SfilB0M8qIb8Zyx1LBQWV
+1HOzaHH5/QUUcrXKdYXP2znxPq9oRxVVN4KNAuQ1wZ7+TdsUsZJsMVpOYjhYB/h7
+eSsNivnNtapSqhGfhAggSAehY7L9MtTUKCdYdXJbZ8X1qOfa25AJYWF/5qwsU3Jl
+hskdBLf9d4MESocUTT/mDhljY1hL63S0CAXH99j/hQ5xGDnnTzHA5slRXkMCPJoz
+V9NExgszIxG/5mQNyipeSGkyFC+5BryNp8gCrZoGGQlAlJmMZpl3OgHsojPX5c/6
+64pk7I/vKKjnL/d12l8Y31+1CdYXoUZlUon48jrohtaYF58ja4mGTszruPFuzfx/
+b/dkSVAw1GRHskqxbsGVet/Z7heHfSPDXzkm+Upyu15yO6EDnQolITKMW8TYWx4C
+z7urq84Y13+tOEj6HCvOXDovsCIPQTj97YwkStNOOxOP5Ro4nZB0qGu5wjW88HnK
+MX5PlfRMHMVeO2vho1AUoajtEMkpFWG8fhRZm4VcMYeKt+jiYf8=
+=ojL1
+-END PGP SIGNATURE-

Added: 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512
==
--- 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512
 (added)
+++ 
release/beam/vendor/beam-vendor-bytebuddy-1_10_8/apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip.sha512
 Sun Mar  1 06:26:02 2020
@@ -0,0 +1 @@
+bf78b061cee54e5a59148e817ea83cb55960c25807ed1872286a3b8175da2f68644da6557f7f24bcbcccb226c85572fe73b188fc030856c20641a9c7372426e5
  apache-beam-63492776f154464f67533a6059f162e6b8cf7315-source-release.zip




[beam] branch master updated (3b26ebf -> 116c5e8)

2020-02-29 Thread iemejia
This is an automated email from the ASF dual-hosted git repository.

iemejia pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git.


from 3b26ebf  Adds DisplayData for StateSpecs used by stateful ParDos
 add 612d3d1  [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get 
SchemaCoders for registered types
 add 116c5e8  Merge pull request #10974: [BEAM-9384] Add 
SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types

No new revisions were added by this update.

Summary of changes:
 .../org/apache/beam/sdk/schemas/SchemaCoder.java   |  3 +-
 .../apache/beam/sdk/schemas/SchemaRegistry.java| 57 --
 .../beam/sdk/schemas/SchemaRegistryTest.java   | 20 
 3 files changed, 63 insertions(+), 17 deletions(-)



[GitHub] [beam] iemejia merged pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types

2020-02-29 Thread GitBox
iemejia merged pull request #10974: [BEAM-9384] Add 
SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
URL: https://github.com/apache/beam/pull/10974
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] iemejia commented on a change in pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types

2020-02-29 Thread GitBox
iemejia commented on a change in pull request #10974: [BEAM-9384] Add 
SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
URL: https://github.com/apache/beam/pull/10974#discussion_r386080476
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaRegistry.java
 ##
 @@ -288,6 +281,38 @@ public void registerSchemaProvider(SchemaProvider 
schemaProvider) {
 return getProviderResult((SchemaProvider p) -> 
p.fromRowFunction(typeDescriptor));
   }
 
+  /**
+   * Retrieve a {@link SchemaCoder} for a given {@link Class} type. If no 
schema exists, throws
+   * {@link * NoSuchSchemaException}.
+   */
+  public  SchemaCoder getSchemaCoder(Class clazz) throws 
NoSuchSchemaException {
+return getSchemaCoder(TypeDescriptor.of(clazz));
+  }
+
+  /**
+   * Retrieve a {@link SchemaCoder} for a given {@link TypeDescriptor} type. 
If no schema exists,
+   * throws {@link * NoSuchSchemaException}.
+   */
+  public  SchemaCoder getSchemaCoder(TypeDescriptor typeDescriptor)
+  throws NoSuchSchemaException {
+return SchemaCoder.of(
+getSchema(typeDescriptor),
+typeDescriptor,
+getToRowFunction(typeDescriptor),
+getFromRowFunction(typeDescriptor));
 
 Review comment:
   Interesting. I have not thought about making KafkaRecord 'schema' like good 
point. There are some consequences on that that are still not clear to me (like 
how will we deal with the runtime resolution part of Schemas for KV that we do 
now with the Confluent Schema Registry support. I am going to give it a try and 
ping you once I have something in the other PR #10978. Let's continue that 
discussion there. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] YYTVicky opened a new pull request #11008: Update comment to tell user this is not secure

2020-02-29 Thread GitBox
YYTVicky opened a new pull request #11008: Update comment to tell user this is 
not secure
URL: https://github.com/apache/beam/pull/11008
 
 
   **Please** add a meaningful description for your change here
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
S

[GitHub] [beam] matthiasa4 commented on issue #10958: [BEAM] Submitting final communication strategy

2020-02-29 Thread GitBox
matthiasa4 commented on issue #10958: [BEAM] Submitting final communication 
strategy
URL: https://github.com/apache/beam/pull/10958#issuecomment-592980206
 
 
   LGTM! I think some of the artifacts could also go on the [community 
part](https://beam.apache.org/community/) of the website if rewritten as 
guidelines?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on issue #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types

2020-02-29 Thread GitBox
reuvenlax commented on issue #10974: [BEAM-9384] Add 
SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
URL: https://github.com/apache/beam/pull/10974#issuecomment-592962672
 
 
   lgtm


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10974: [BEAM-9384] Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10974: [BEAM-9384] Add 
SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
URL: https://github.com/apache/beam/pull/10974#discussion_r386038872
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaRegistry.java
 ##
 @@ -288,6 +281,38 @@ public void registerSchemaProvider(SchemaProvider 
schemaProvider) {
 return getProviderResult((SchemaProvider p) -> 
p.fromRowFunction(typeDescriptor));
   }
 
+  /**
+   * Retrieve a {@link SchemaCoder} for a given {@link Class} type. If no 
schema exists, throws
+   * {@link * NoSuchSchemaException}.
+   */
+  public  SchemaCoder getSchemaCoder(Class clazz) throws 
NoSuchSchemaException {
+return getSchemaCoder(TypeDescriptor.of(clazz));
+  }
+
+  /**
+   * Retrieve a {@link SchemaCoder} for a given {@link TypeDescriptor} type. 
If no schema exists,
+   * throws {@link * NoSuchSchemaException}.
+   */
+  public  SchemaCoder getSchemaCoder(TypeDescriptor typeDescriptor)
+  throws NoSuchSchemaException {
+return SchemaCoder.of(
+getSchema(typeDescriptor),
+typeDescriptor,
+getToRowFunction(typeDescriptor),
+getFromRowFunction(typeDescriptor));
 
 Review comment:
   It's useful when integrating schema code with code that does not yet 
understand schemas.
   
   In the KafkaIO example I think that the ideal solution would be to allow a 
Schema on KafkaRecord (this probably requires us to add Java generic type 
awareness to schema inference though), in which case they keyCoder and 
valueCoder isn't needed. I agree that allowing easy inference of SchemaCoder 
allows for lower-effort integration of schemas in code like this, though 
hopefully this is just a short-term solution.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386038309
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages
+of the Beam SDK being used. This allows Beam users to continue using native 
types while reaping the advantage of 
+having Beam understand their element schemas.
+ 
+ {:.language-java}
+ In Java you could use the following set of classes to represent the purchase 
schema.  Beam will automatically  
+ infer the correct schema based on the members of the class.
+
+```java
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost 

[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386038301
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages
+of the Beam SDK being used. This allows Beam users to continue using native 
types while reaping the advantage of 
+having Beam understand their element schemas.
+ 
+ {:.language-java}
+ In Java you could use the following set of classes to represent the purchase 
schema.  Beam will automatically  
+ infer the correct schema based on the members of the class.
+
+```java
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost 

[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386038304
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages
+of the Beam SDK being used. This allows Beam users to continue using native 
types while reaping the advantage of 
+having Beam understand their element schemas.
+ 
+ {:.language-java}
+ In Java you could use the following set of classes to represent the purchase 
schema.  Beam will automatically  
+ infer the correct schema based on the members of the class.
+
+```java
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost 

[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386038297
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
 
 Review comment:
   hmmm this section is trying to get across the basic theory that schemas 
are _the_ types, and if multiple Pl types have the same schema then they are 
interchangeable. The goal of this section isn't a reference, though we use some 
examples to illustrate the point. The other two sections you mention are meant 
to be more of a reference.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386038150
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages
+of the Beam SDK being used. This allows Beam users to continue using native 
types while reaping the advantage of 
+having Beam understand their element schemas.
+ 
+ {:.language-java}
+ In Java you could use the following set of classes to represent the purchase 
schema.  Beam will automatically  
+ infer the correct schema based on the members of the class.
+
+```java
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost 

[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386037819
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
 
 Review comment:
   I meant within a pipeline as that's where users create these intermediate 
objects - maybe SDK pipeline?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386037870
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
 
 Review comment:
   This is true for other tables in this doc as well (e.g. coders). Maybe we 
should figure out a way to fix all of them at once?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386037889
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386037750
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
 
 Review comment:
   I think better to wait until we add Python schema docs


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386037654
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages
+of the Beam SDK being used. This allows Beam users to continue using native 
types while reaping the advantage of 
+having Beam understand their element schemas.
+ 
+ {:.language-java}
+ In Java you could use the following set of classes to represent the purchase 
schema.  Beam will automatically  
+ infer the correct schema based on the members of the class.
+
+```java
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost 

[GitHub] [beam] reuvenlax commented on a change in pull request #10767: Document Beam Schemas

2020-02-29 Thread GitBox
reuvenlax commented on a change in pull request #10767: Document Beam Schemas
URL: https://github.com/apache/beam/pull/10767#discussion_r386037615
 
 

 ##
 File path: website/src/documentation/programming-guide.md
 ##
 @@ -1970,7 +1976,1076 @@ records.apply("WriteToText",
 See the [Beam-provided I/O Transforms]({{site.baseurl 
}}/documentation/io/built-in/)
 page for a list of the currently available I/O transforms.
 
-## 6. Data encoding and type safety {#data-encoding-and-type-safety}
+## 6. Schemas {#schemas}
+Often, the type of records being processed have an obvious structure. Common 
Beam sources produce
+JSON, Avro, Protocol Buffer, or database row objects; all of these types have 
well defined structures, 
+structures that can often be determined by examining the type. Even within a 
pipeline, Simple Java POJOs 
+(or  equivalent structures in other languages) are often used as intermediate 
types, and these also have a
+ clear structure that can be inferred by inspecting the class. By 
understanding the structure of a pipeline’s 
+ records, we can provide much more concise APIs for data processing.
+ 
+### 6.1. What is a schema {#what-is-a-schema}
+Most structured records share some common characteristics: 
+* They can be subdivided into separate named fields. Fields usually have 
string names, but sometimes - as in the case of indexed
+ tuples - have numerical indices instead.
+* There is a confined list of primitive types that a field can have. These 
often match primitive types in most programming 
+ languages: int, long, string, etc.
+* Often a field type can be marked as optional (sometimes referred to as 
nullable) or required.
+
+In addition, often records have a nested structure. A nested structure occurs 
when a field itself has subfields so the 
+type of the field itself has a schema. Fields that are  array or map types is 
also a common feature of these structured 
+records.
+
+For example, consider the following schema, representing actions in a 
fictitious e-commerce company:
+
+**Purchase**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  userId
+  STRING  
+
+
+  itemId
+  INT64  
+
+
+  shippingAddress
+  ROW(ShippingAddress)  
+
+
+  cost
+  INT64  
+
+
+  transactions
+  ARRAY[ROW(Transaction)]  
+  
+  
+
+
+
+**ShippingAddress**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  streetAddress
+  STRING  
+
+
+  city
+  STRING  
+
+
+  state
+  nullable STRING  
+
+
+  country
+  STRING  
+
+
+  postCode
+  STRING  
+  
+  
+ 
+
+
+**Transaction**
+
+  
+
+  Field Name
+  Field Type
+
+  
+  
+
+  bank
+  STRING  
+
+
+  purchaseAmount
+  DOUBLE  
+  
+  
+
+
+
+Purchase event records are represented by the aove purchase schema. Each 
purchase event contains a shipping address, which
+is a nested row containing its own schema. Each purchase also contains a list 
of credit-card transactions 
+(a list, because a purchase might be split across multiple credit cards); each 
item in the transaction list is a row 
+with its own schema.
+
+This provides an abstract description of the types involved, one that is 
abstracted away from any specific programming 
+language.
+
+Schemas provide us a type-system for Beam records that is independent of any 
specific programming-language type. There
+might be multiple Java classes that all have the same schema (for example a 
Protocol-Buffer class or a POJO class),
+and Beam will allow us to seamlessly convert between these types. Schemas also 
provide a simple way to reason about 
+types across different programming-language APIs.
+
+A `PCollection` with a schema does not need to have a `Coder` specified, as 
Beam knows how to encode and decode 
+Schema rows.
+
+### 6.2. Schemas for programming language types {#schemas-for-pl-types}
+While schemas themselves are language independent, they are designed to embed 
naturally into the programming languages
+of the Beam SDK being used. This allows Beam users to continue using native 
types while reaping the advantage of 
+having Beam understand their element schemas.
+ 
+ {:.language-java}
+ In Java you could use the following set of classes to represent the purchase 
schema.  Beam will automatically  
+ infer the correct schema based on the members of the class.
+
+```java
+@DefaultSchema(JavaBeanSchema.class)
+public class Purchase {
+  public String getUserId();  // Returns the id of the user who made the 
purchase.
+  public long getItemId();  // Returns the identifier of the item that was 
purchased.
+  public ShippingAddress getShippingAddress();  // Returns the shipping 
address, a nested type.
+  public long getCostCents();  // Returns the cost 

[GitHub] [beam] steveniemitz commented on issue #10852: [BEAM-9308] Decorrelate state cleanup timers

2020-02-29 Thread GitBox
steveniemitz commented on issue #10852: [BEAM-9308] Decorrelate state cleanup 
timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-592959333
 
 
   > maybe we need to explore the prioritization issue a bit more.
   
   Agreed, I think ideally the state cleanup timers would have a (much?) lower 
priority than everything else so they don't starve out more important "user" 
work.
   
   > Is this. a blocker for. you? If so then. maybe we can add a parameter to 
DataflowPipelineOptions to control this so we don't take the risk of changing 
the default behavior without more data.
   
   We run our own fork of the anyways, so it's not particularly a blocker here. 
 I mostly just intended this PR as a conversation starter.
   
   I am curious about your comment above though ("We currently rely on the 
state cleanup timer for watermark holds").  From what I've observed in the 
code, the state cleanup is set for after the window end, so delaying it 
slightly more shouldn't cause any correctness issues, correct?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] reuvenlax commented on issue #10852: [BEAM-9308] Decorrelate state cleanup timers

2020-02-29 Thread GitBox
reuvenlax commented on issue #10852: [BEAM-9308] Decorrelate state cleanup 
timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-592958092
 
 
   I'm trying to think of a principled way to do this - maybe we need to 
explore the prioritization issue a bit more.
   
   Is this. a blocker for. you? If so then. maybe we can add a parameter to 
DataflowPipelineOptions to control this so we don't take the risk of changing 
the default behavior without more data.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[beam] branch master updated (6e69b26 -> 3b26ebf)

2020-02-29 Thread alexvanboxel
This is an automated email from the ASF dual-hosted git repository.

alexvanboxel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git.


from 6e69b26  Merge pull request #11001: [BEAM-9405] Fix post-commit error 
about create_job_service
 add 3b26ebf  Adds DisplayData for StateSpecs used by stateful ParDos

No new revisions were added by this update.

Summary of changes:
 .../apache/beam/sdk/coders/SerializableCoder.java  |  5 +++
 .../java/org/apache/beam/sdk/transforms/ParDo.java | 47 --
 .../org/apache/beam/sdk/transforms/ParDoTest.java  | 47 ++
 3 files changed, 96 insertions(+), 3 deletions(-)



[GitHub] [beam] alexvanboxel merged pull request #11004: Adds DisplayData for StateSpecs used by stateful ParDos

2020-02-29 Thread GitBox
alexvanboxel merged pull request #11004: Adds DisplayData for StateSpecs used 
by stateful ParDos
URL: https://github.com/apache/beam/pull/11004
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [beam] chadrik opened a new pull request #11007: [BEAM-7746] Runtime change to timestamp/duration equality

2020-02-29 Thread GitBox
chadrik opened a new pull request #11007: [BEAM-7746] Runtime change to 
timestamp/duration equality
URL: https://github.com/apache/beam/pull/11007
 
 
   mypy *strongly* recommends that "__eq__" work with arbitrary objects, by 
returning `NotImplemented`
   
   This PR prevents the following mypy errors:
   
   ```
   apache_beam/utils/timestamp.py:193: error: Argument 1 of "__eq__" is 
incompatible with supertype "object"; supertype defines the argument type as 
"object"  [override]
   apache_beam/utils/timestamp.py:193: note: It is recommended for "__eq__" to 
work with arbitrary objects, for example:
   apache_beam/utils/timestamp.py:193: note: def __eq__(self, other: 
object) -> bool:
   apache_beam/utils/timestamp.py:193: note: if not isinstance(other, 
Timestamp):
   apache_beam/utils/timestamp.py:193: note: return NotImplemented
   apache_beam/utils/timestamp.py:193: note: return 
   apache_beam/utils/timestamp.py:346: error: Argument 1 of "__eq__" is 
incompatible with supertype "object"; supertype defines the argument type as 
"object"  [override]
   apache_beam/utils/timestamp.py:346: note: It is recommended for "__eq__" to 
work with arbitrary objects, for example:
   apache_beam/utils/timestamp.py:346: note: def __eq__(self, other: 
object) -> bool:
   apache_beam/utils/timestamp.py:346: note: if not isinstance(other, 
Duration):
   apache_beam/utils/timestamp.py:346: note: return NotImplemented
   apache_beam/utils/timestamp.py:346: note: return 
   ```
   
   I wrote an equality test in `timestamp_test.py` awhile back to cover this 
case.
   
   For reference, it looks like this:
   
   ```python
 def test_equality(self):
   for min_val in (Timestamp(1), Duration(1), 1, 1.1):
 for max_val in (Timestamp(123), Duration(123), 123, 123.4):
   self.assertTrue(min_val < max_val, "%s < %s" % (min_val, max_val))
   self.assertTrue(min_val <= max_val, "%s <= %s" % (min_val, max_val))
   self.assertTrue(max_val > min_val, "%s > %s" % (max_val, min_val))
   self.assertTrue(max_val >= min_val, "%s >= %s" % (max_val, min_val))
   ```
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostComm