Re: Beam SQL Improvements

Jean-Baptiste Onofré Thu, 26 Apr 2018 01:18:15 -0700

For now we have a generic schema interface. Json-b can be an impl, avro could 
be another one.


Regards
JB

Le 26 avr. 2018 à 12:08, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com> a 
écrit:
>Hmm,
>
>avro has still the pitfalls to have an uncontrolled stack which brings
>way
>too much dependencies to be part of any API,
>this is why I proposed a JSON-P based API (JsonObject) with a custom
>beam
>entry for some metadata (headers "à la Camel").
>
>
>Romain Manni-Bucau
>@rmannibucau <https://twitter.com/rmannibucau> |  Blog
><https://rmannibucau.metawerx.net/> | Old Blog
><http://rmannibucau.wordpress.com> | Github
><https://github.com/rmannibucau> |
>LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
><https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
>2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Hi Ismael
>>
>> You mean directly in Beam SQL ?
>>
>> That will be part of schema support: generic record could be one of
>the
>> payload with across schema.
>>
>> Regards
>> JB
>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" <ieme...@gmail.com> a écrit:
>>>
>>> Hello Anton,
>>>
>>> Thanks for the descriptive email and the really useful work. Any
>plans
>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems
>Avro
>>> is a natural fit for this approach too.
>>>
>>> Regards,
>>> Ismaël
>>>
>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com>
>wrote:
>>>
>>>>  Hi,
>>>>
>>>>  I want to highlight a couple of improvements to Beam SQL we have
>been
>>>>  working on recently which are targeted to make Beam SQL API easier
>to use.
>>>>  Specifically these features simplify conversion of Java Beans and
>JSON
>>>>  strings to Rows.
>>>>
>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>
>>>>  **Caveat: this is still work in progress, and has known bugs and
>incomplete
>>>>  features, see below for details.**
>>>>
>>>>  Background
>>>>
>>>>  Beam SQL queries can only be applied to PCollection<Row>. This
>means that
>>>>  users need to convert whatever PCollection elements they have to
>Rows before
>>>>  querying them with SQL. This usually requires manually creating a
>Schema and
>>>>  implementing a custom conversion PTransform<PCollection<Element>,
>>>>  PCollection<Row>> (see Beam SQL Guide).
>>>>
>>>>  The improvements described here are an attempt to reduce this
>overhead for
>>>>  few common cases, as a start.
>>>>
>>>>  Status
>>>>
>>>>  Introduced a InferredRowCoder to automatically generate rows from
>beans.
>>>>  Removes the need to manually define a Schema and Row conversion
>logic;
>>>>  Introduced JsonToRow transform to automatically parse JSON objects
>to Rows.
>>>>  Removes the need to manually implement a conversion logic;
>>>>  This is still experimental work in progress, APIs will likely
>change;
>>>>  There are known bugs/unsolved problems;
>>>>
>>>>
>>>>  Java Beans
>>>>
>>>>  Introduced a coder which facilitates Rows generation from Java
>Beans.
>>>>  Reduces the overhead to:
>>>>
>>>>  /** Some user-defined Java Bean */
>>>>>  class JavaBeanObject implements Serializable {
>>>>>      String getName() { ... }
>>>>>  }
>>>>>
>>>>>
>>>>>
>>>>>  // Obtain the objects:
>>>>>  PCollection<JavaBeanObject> javaBeans = ...;
>>>>>
>>>>>
>>>>>
>>>>>  // Convert to Rows and apply a SQL query:
>>>>>  PCollection<Row> queryResult =
>>>>>    javaBeans
>>>>>
>.setCoder(InferredRowCoder.ofSerializable(JavaBeanObject.class))
>>>>>       .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));
>>>>>
>>>>
>>>>
>>>>  Notice, there is no more manual Schema definition or custom
>conversion
>>>>  logic.
>>>>
>>>>  Links
>>>>
>>>>   example;
>>>>   InferredRowCoder;
>>>>   test;
>>>>
>>>>
>>>>  JSON
>>>>
>>>>  Introduced JsonToRow transform. It is possible to query a
>>>>  PCollection<String> that contains JSON objects like this:
>>>>
>>>>  // Assuming JSON objects look like this:
>>>>>  // { "type" : "foo", "size" : 333 }
>>>>>
>>>>>  // Define a Schema:
>>>>>  Schema jsonSchema =
>>>>>    Schema
>>>>>      .builder()
>>>>>      .addStringField("type")
>>>>>      .addInt32Field("size")
>>>>>      .build();
>>>>>
>>>>>  // Obtain PCollection of the objects in JSON format:
>>>>>  PCollection<String> jsonObjects = ...
>>>>>
>>>>>  // Convert to Rows and apply a SQL query:
>>>>>  PCollection<Row> queryResults =
>>>>>    jsonObjects
>>>>>      .apply(JsonToRow.withSchema(jsonSchema))
>>>>>      .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION
>GROUP BY
>>>>>  type"));
>>>>>
>>>>
>>>>
>>>>  Notice, JSON to Row conversion is done by JsonToRow transform. It
>is
>>>>  currently required to supply a Schema.
>>>>
>>>>  Links
>>>>
>>>>   JsonToRow;
>>>>   test/example;
>>>>
>>>>
>>>>  Going Forward
>>>>
>>>>  fix bugs (BEAM-4163, BEAM-4161 ...)
>>>>  implement more features (BEAM-4167, more types of objects);
>>>>  wire this up with sources/sinks to further simplify SQL API;
>>>>
>>>>
>>>>  Thank you,
>>>>  Anton
>>>>
>>>

Re: Beam SQL Improvements

Reply via email to