This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch mergebot in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit 292e22c1fc683128810a00ed937fdd2f73427234 Author: mingmxu <ming...@ebay.com> AuthorDate: Mon Aug 21 09:16:36 2017 -0700 remove post page move internal imple to the end of page --- src/_data/authors.yml | 4 -- src/_posts/2017-07-21-sql-dsl.md | 64 -------------------------- src/documentation/dsls/sql.md | 98 +++++++++++++++++++++------------------- 3 files changed, 51 insertions(+), 115 deletions(-) diff --git a/src/_data/authors.yml b/src/_data/authors.yml index 1b76ddb..d8cd836 100644 --- a/src/_data/authors.yml +++ b/src/_data/authors.yml @@ -28,10 +28,6 @@ klk: name: Kenneth Knowles email: k...@apache.org twitter: KennKnowles -mingmxu: - name: Mingmin Xu - email: mingm...@gmail.com - twitter: robertwb: name: Robert Bradshaw email: rober...@apache.org diff --git a/src/_posts/2017-07-21-sql-dsl.md b/src/_posts/2017-07-21-sql-dsl.md deleted file mode 100644 index 95df3f8..0000000 --- a/src/_posts/2017-07-21-sql-dsl.md +++ /dev/null @@ -1,64 +0,0 @@ ---- -layout: post -title: "Use Beam SQL DSL to build a pipeline" -date: 2017-07-21 00:00:00 -0800 -excerpt_separator: <!--more--> -categories: blog -authors: - - mingmxu ---- - -Beam SQL DSL provides the capability to execute standard SQL queries using Beam Java SDK. It packages the backend parsing/validation/assembling features, and delivers a SDK style API to developers, to express a processing logic using SQL statements, from simple TABLE_FILTER, to complex queries containing JOIN/GROUP_BY etc. - -<!--more--> - -# <a name="overview"></a>1. Overview -SQL is a well-adopted standard to process data with concise syntax. With DSL APIs, now `PCollection`s can be queried with standard SQL statements, like a regular table. The DSL APIs leverage [Apache Calcite](http://calcite.apache.org/) to parse and optimize SQL queries, then translate it into a _composite_ Beam `PTransform`. In this way, both SQL and normal Beam `PTransform`s can be mixed in one pipeline. - -# <a name="usage"></a>2. Usage of DSL APIs -`BeamSql` is the only interface(with two methods `BeamSql.query()` and `BeamSql.simpleQuery()`) for developers. It wraps the back-end details of parsing/validation/assembling, and deliver a Beam SDK style API that can take either simple TABLE_FILTER queries or complex queries containing JOIN/GROUP_BY etc. - -*Note*, the two methods are equivalent in functionality, `BeamSql.query()` applies on a `PCollectionTuple` with one or many input `PCollection`s; `BeamSql.simpleQuery()` is a simplified API which applies on single `PCollection`. - -[BeamSqlExample](https://github.com/apache/beam/blob/DSL_SQL/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/example/BeamSqlExample.java) in code repository shows the usage of both APIs: - -``` -//Step 1. create a source PCollection with Create.of(); -BeamRecordSqlType type = BeamRecordSqlType.create(fieldNames, fieldTypes); -... - -PCollection<BeamRecord> inputTable = PBegin.in(p).apply(Create.of(row...) - .withCoder(type.getRecordCoder())); - -//Step 2. (Case 1) run a simple SQL query over input PCollection with BeamSql.simpleQuery; -PCollection<BeamRecord> outputStream = inputTable.apply( - BeamSql.simpleQuery("select c1, c2, c3 from PCOLLECTION where c1=1")); - - -//Step 2. (Case 2) run the query with BeamSql.query over result PCollection of (case 1); -PCollection<BeamRecord> outputStream2 = - PCollectionTuple.of(new TupleTag<BeamRecord>("CASE1_RESULT"), outputStream) - .apply(BeamSql.query("select c2, sum(c3) from CASE1_RESULT group by c2")); -``` - -In Step 1, a `PCollection<BeamRecord>` is prepared as the source dataset. The work to generate a queriable `PCollection<BeamRecord>` is beyond the scope of Beam SQL DSL. - -Step 2(Case 1) shows the usage to run a query with `BeamSql.simpleQuery()`, be aware that the input `PCollection` is named with a fixed table name __PCOLLECTION__. Step 2(Case 2) is another example to run a query with `BeamSql.query()`. A Table name is specified when adding `PCollection` to `PCollectionTuple`. As each call of either `BeamSql.query()` or `BeamSql.simpleQuery()` has its own schema repository, developers need to include all `PCollection`s that would be used in your query. - -# <a name="functionality"></a>4. Functionality in Beam SQL -Just as the unified model for both bounded and unbounded data in Beam, SQL DSL provides the same functionalities for bounded and unbounded `PCollection` as well. - -Note that, SQL support is not fully completed. Queries that include unsupported features would cause an `UnsupportedOperationException`. - -## <a name="features"></a>4.1. Supported Features -The following features are supported in current repository: - -1. filter clauses; -2. data field projections; -3. aggregations (global_window, fixed_window, sliding_window, session_window); -4. Join (inner, left_outer, right_outer); -5. built-in SQL functions -6. User Defined Function (UDF) and User Defined Aggregate Function (UDAF); - -For more deep dive, please visit [DSLs: SQL]({{ site.baseurl }}/documentation/dsls/sql/) - diff --git a/src/documentation/dsls/sql.md b/src/documentation/dsls/sql.md index bfcbd1c..4f25fe4 100644 --- a/src/documentation/dsls/sql.md +++ b/src/documentation/dsls/sql.md @@ -5,55 +5,23 @@ permalink: /documentation/dsls/sql/ --- * [1. Overview](#overview) -* [2. The Internal of Beam SQL](#internal-of-sql) -* [3. Usage of DSL APIs](#usage) -* [4. Functionality in Beam SQL](#functionality) - * [4.1. Supported Features](#features) - * [4.2. Intro of BeamSqlRow](#beamsqlrow) - * [4.3. Data Types](#data-type) - * [4.4. built-in SQL functions](#built-in-functions) +* [2. Usage of DSL APIs](#usage) +* [3. Functionality in Beam SQL](#functionality) + * [3.1. Supported Features](#features) + * [3.2. Intro of BeamSqlRow](#beamsqlrow) + * [3.3. Data Types](#data-type) + * [3.4. built-in SQL functions](#built-in-functions) +* [4. The Internal of Beam SQL](#internal-of-sql) This page describes the implementation of Beam SQL, and how to simplify a Beam pipeline with DSL APIs. +> Note, Beam SQL hasn't been merged to master branch yet(being developed with branch [DSL_SQL](https://github.com/apache/beam/tree/DSL_SQL)), but is coming soon. + # <a name="overview"></a>1. Overview SQL is a well-adopted standard to process data with concise syntax. With DSL APIs, now `PCollection`s can be queried with standard SQL statements, like a regular table. The DSL APIs leverage [Apache Calcite](http://calcite.apache.org/) to parse and optimize SQL queries, then translate into a composite Beam `PTransform`. In this way, both SQL and normal Beam `PTransform`s can be mixed in the same pipeline. -# <a name="internal-of-sql"></a>2. The Internal of Beam SQL -Figure 1 describes the back-end steps from a SQL statement to a Beam `PTransform`. - -![Workflow of Beam SQL DSL]({{ "/images/beam_sql_dsl_workflow.png" | prepend: site.baseurl }} "workflow of Beam SQL DSL") - -**Figure 1** workflow of Beam SQL DSL - -Given a `PCollection` and the query as input, first of all the input `PCollection` is registered as a table in the schema repository. Then it's processed as: - -1. SQL query is parsed according to grammar to generate a SQL Abstract Syntax Tree; -2. Validate against table schema, and output a logical plan represented with relational algebras; -3. Relational rules are applied to convert it to a physical plan, expressed with Beam components. An optimizer is optional to update the plan; -4. Finally, the Beam physical plan is compiled as a composite `PTransform`; -Here is an example to show a query that filters and projects from an input `PCollection`: - -``` -SELECT USER_ID, USER_NAME FROM PCOLLECTION WHERE USER_ID = 1 -``` - -The logical plan is shown as: - -``` -LogicalProject(USER_ID=[$0], USER_NAME=[$1]) - LogicalFilter(condition=[=($0, 1)]) - LogicalTableScan(table=[[PCOLLECTION]]) -``` - -And compiled as a composite `PTransform` - -``` -pCollection.apply(BeamSqlFilter...) - .apply(BeamSqlProject...) -``` - -# <a name="usage"></a>3. Usage of DSL APIs +# <a name="usage"></a>2. Usage of DSL APIs `BeamSql` is the only interface(with two methods `BeamSql.query()` and `BeamSql.simpleQuery()`) for developers. It wraps the back-end details of parsing/validation/assembling, and deliver a Beam SDK style API that can take either simple TABLE_FILTER queries or complex queries containing JOIN/GROUP_BY etc. *Note*, the two methods are equivalent in functionality, `BeamSql.query()` applies on a `PCollectionTuple` with one or many input `PCollection`s; `BeamSql.simpleQuery()` is a simplified API which applies on single `PCollection`. @@ -83,12 +51,12 @@ In Step 1, a `PCollection<BeamRecord>` is prepared as the source dataset. The wo Step 2(Case 1) shows the usage to run a query with `BeamSql.simpleQuery()`, be aware that the input `PCollection` is named with a fixed table name __PCOLLECTION__. Step 2(Case 2) is another example to run a query with `BeamSql.query()`. A Table name is specified when adding `PCollection` to `PCollectionTuple`. As each call of either `BeamSql.query()` or `BeamSql.simpleQuery()` has its own schema repository, developers need to include all `PCollection`s that would be used in your query. -# <a name="functionality"></a>4. Functionality in Beam SQL +# <a name="functionality"></a>3. Functionality in Beam SQL Just as the unified model for both bounded and unbounded data in Beam, SQL DSL provides the same functionalities for bounded and unbounded `PCollection` as well. Note that, SQL support is not fully completed. Queries that include unsupported features would cause an `UnsupportedOperationException`. -## <a name="features"></a>4.1. Supported Features +## <a name="features"></a>3.1. Supported Features The following features are supported in current repository: **1. filter clauses;** @@ -217,7 +185,7 @@ PCollection<BeamSqlRow> result = BeamSql.simpleQuery(sql).withUdaf("squaresum", new SquareSum())); ``` -## <a name="beamsqlrow"></a>4.2. Intro of BeamRecord +## <a name="beamsqlrow"></a>3.2. Intro of BeamRecord `BeamRecord`, described by `BeamRecordType`(extended `BeamRecordSqlType` in Beam SQL) and encoded/decoded by `BeamRecordCoder`, represents a single, immutable row in a Beam SQL `PCollection`. Similar as _row_ in relational database, each `BeamRecord` consists of named columns with fixed types(see [4.3. Data Types](#data-type)). A Beam SQL `PCollection` can be created from an external source, in-memory data or derive from another SQL query. For `PCollection`s from external source or in-memory data, it's required to specify coder explcitly; `PCollection` derived from SQL query has the coder set already. Below is one example: @@ -234,7 +202,7 @@ PCollection<BeamRecord> inputTable = PBegin.in(p).apply(Create.of(row) .withCoder(type.getRecordCoder())); ``` -## <a name="data-type"></a>4.3. Data Types +## <a name="data-type"></a>3.3. Data Types Each type in Beam SQL maps to a Java class to holds the value in `BeamRecord`. The following table lists the relation between SQL types and Java classes, which are supported in current repository: | SQL Type | Java class | @@ -250,7 +218,7 @@ Each type in Beam SQL maps to a Java class to holds the value in `BeamRecord`. T | Types.TIMESTAMP | java.util.Date | {:.table} -## <a name="built-in-functions"></a>4.4. built-in SQL functions +## <a name="built-in-functions"></a>3.4. built-in SQL functions Beam SQL has implemented lots of build-in functions defined in [Apache Calcite](http://calcite.apache.org). The available functions are listed as below: @@ -375,3 +343,39 @@ Beam SQL has implemented lots of build-in functions defined in [Apache Calcite]( | MAX(value) | Returns the maximum value of value across all input values | | MIN(value) | Returns the minimum value of value across all input values | {:.table} + +# <a name="internal-of-sql"></a>4. The Internal of Beam SQL +Figure 1 describes the back-end steps from a SQL statement to a Beam `PTransform`. + +![Workflow of Beam SQL DSL]({{ "/images/beam_sql_dsl_workflow.png" | prepend: site.baseurl }} "workflow of Beam SQL DSL") + +**Figure 1** workflow of Beam SQL DSL + +Given a `PCollection` and the query as input, first of all the input `PCollection` is registered as a table in the schema repository. Then it's processed as: + +1. SQL query is parsed according to grammar to generate a SQL Abstract Syntax Tree; +2. Validate against table schema, and output a logical plan represented with relational algebras; +3. Relational rules are applied to convert it to a physical plan, expressed with Beam components. An optimizer is optional to update the plan; +4. Finally, the Beam physical plan is compiled as a composite `PTransform`; + +Here is an example to show a query that filters and projects from an input `PCollection`: + +``` +SELECT USER_ID, USER_NAME FROM PCOLLECTION WHERE USER_ID = 1 +``` + +The logical plan is shown as: + +``` +LogicalProject(USER_ID=[$0], USER_NAME=[$1]) + LogicalFilter(condition=[=($0, 1)]) + LogicalTableScan(table=[[PCOLLECTION]]) +``` + +And compiled as a composite `PTransform` + +``` +pCollection.apply(BeamSqlFilter...) + .apply(BeamSqlProject...) +``` + -- To stop receiving notification emails like this one, please contact "commits@beam.apache.org" <commits@beam.apache.org>.