Re: [DISCUSS] Query external resources as Tables with Beam SQL

Rui Wang Thu, 05 Mar 2020 11:10:24 -0800

"CREATE TABLE" can be used to indicate if a table does not exist, BeamSQL
will help create it in storage systems if allowed, while "CREATE EXTERNAL
TABLE" can be used only for registering a table, no matter if the table
exists or not. BeamSQL provides a finer-grained way to distinct
different behaviours.


In both cases BeamSQL does not store the table. Another approach is to
leverage the options/table property to specify the expected behaviour.


-Rui

On Thu, Mar 5, 2020 at 10:55 AM Andrew Pilloud <apill...@google.com> wrote:

> I'm not following the "CREATE TABLE" vs "CREATE EXTERNAL TABLE"
> distinction. We added the "EXTERNAL" to make it clear that Beam wasn't
> storing the table. Most of our current table providers will create the
> underlying table as needed.
>
> Andrew
>
> On Thu, Mar 5, 2020 at 10:47 AM Rui Wang <ruw...@google.com> wrote:
>
>> There are two pieces of news from the proposal:
>> 1. Spanner source in SQL. (Welcome to contribute it)
>> 2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is
>> whether assuming the table exists or not)
>>
>>
>> There is a table property in the statement already that you can reuse to
>> save your options.
>>
>>
>> -Rui
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <taher...@gmail.com>
>> wrote:
>>
>>> Also auto creation is not there
>>>
>>> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <taher...@gmail.com>
>>> wrote:
>>>
>>>> Proposal is to add more sources and also have time event time or
>>>> processing enhancements further on them
>>>>
>>>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <apill...@google.com>
>>>> wrote:
>>>>
>>>>> I believe we have this functionality alredy:
>>>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/
>>>>>
>>>>> Existing GCP tables can also be loaded through the GCP datacatalog
>>>>> metastore. What are you proposing that is new?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <taher...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>          We have been using Apache Beam extensively to process huge
>>>>>> amounts of data, while beam is really powerful and can solve a huge 
>>>>>> number
>>>>>> of use cases. A Beam job's development and testing time is significantly
>>>>>> high.
>>>>>>
>>>>>>    This gap can be filled with Beam SQL, where a complete SQL based
>>>>>> interface can reduce development and testing time to matter of minutes, 
>>>>>> it
>>>>>> also makes Apache Beam more user friendly where a wide variety of 
>>>>>> audience
>>>>>> with different analytical skillsets can interact.
>>>>>>
>>>>>> The current Beam SQL is still needs to be used programmatically, and
>>>>>> so I propose the following additions/improvements.
>>>>>>
>>>>>> *Note: Whist the below given examples are more GCP biased, they apply
>>>>>> to other sources in a generic manner*
>>>>>>
>>>>>> For Example: Imagine a user who wants to write a stream processing
>>>>>> job on Google Cloud Dataflow. The user wants to process credit card
>>>>>> transaction streams from Google Cloud PubSub (Something like Kafka) and
>>>>>> enrich each record of the stream with some data that is stored in Google
>>>>>> Cloud Spanner, after enrichment the user wishes to write the following 
>>>>>> data
>>>>>> to Google Cloud BigQuery.
>>>>>>
>>>>>> Given Below are the queries which the user should be able to fire on
>>>>>> Beam and the rest should be automatically handled by the framework.
>>>>>>
>>>>>> //Infer schema from Spanner table upon table creation
>>>>>>
>>>>>> CREATE TABLE SPANNER_CARD_INFO
>>>>>>
>>>>>> OPTIONS (
>>>>>>
>>>>>>  ProjectId: “gcp-project”,
>>>>>>
>>>>>>  InstanceId : “spanner-instance-id”,
>>>>>>
>>>>>>  Database: “some-database”,
>>>>>>
>>>>>>  Table: “card_info”,
>>>>>>
>>>>>>  CloudResource: “SPANNER”,
>>>>>>
>>>>>> CreateTableIfNotExists: “FALSE”
>>>>>>
>>>>>>   )
>>>>>>  //Apply schema to each record read from pubsub, and then apply SQL.
>>>>>>
>>>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC
>>>>>>
>>>>>> OPTIONS (
>>>>>>
>>>>>> ProjectId: “gcp-project”,
>>>>>>
>>>>>> Topic: “card-transactions”,
>>>>>>
>>>>>> CloudResource : “PUBSUB”
>>>>>>
>>>>>> SubscriptionId : “subscriptionId-1”,
>>>>>>
>>>>>> CreateTopicIfNotExists: “FALSE”,
>>>>>>
>>>>>> CreateSubscriptionIfNotExist: “TRUE”,
>>>>>>
>>>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc
>>>>>>
>>>>>> JsonRecordSchema : “{
>>>>>>
>>>>>> “CardNumber” : “INT”,
>>>>>>
>>>>>> “Amount”: “DOUBLE”,
>>>>>>
>>>>>> “eventTimeStamp” : “EVENT_TIME”
>>>>>>
>>>>>> }”)
>>>>>>
>>>>>> //Create table in BigQuery if not exists and insert
>>>>>>
>>>>>> CREATE TABLE TRANSACTION_HISTORY
>>>>>>
>>>>>> OPTIONS (
>>>>>>
>>>>>> ProjectId: “gcp-project”,
>>>>>>
>>>>>> CloudResource : “BIGQUERY”
>>>>>>
>>>>>> dataset: “dataset1”,
>>>>>>
>>>>>> table : “table1”,
>>>>>>
>>>>>> CreateTableIfNotExists: “TRUE”,
>>>>>>
>>>>>> TableSchema : “
>>>>>>
>>>>>> {
>>>>>>
>>>>>> “card_number” : “INT”,
>>>>>>
>>>>>> “first_name” : “STRING”,
>>>>>>
>>>>>> “last_name” : “STRING”,
>>>>>>
>>>>>> “phone” : “INT”,
>>>>>>
>>>>>> “city” : “STRING”,
>>>>>>
>>>>>> “amount”: “FLOAT”,
>>>>>>
>>>>>> “eventtimestamp” : “INT”,
>>>>>>
>>>>>> }”)
>>>>>>
>>>>>> //Actual query that should get stretched to a Beam dag
>>>>>>
>>>>>> INSERT INTO TRANSACTION_HISTORY
>>>>>>
>>>>>> SELECT
>>>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp
>>>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO spanner
>>>>>> on (pubsub.card_number = spanner.card_number);
>>>>>>
>>>>>>
>>>>>>
>>>>>> Also to consider that if any of the sources or sinks change, we only
>>>>>> change the SQL and done!.
>>>>>>
>>>>>> Please let me know your thoughts about this.
>>>>>>
>>>>>> Regards,
>>>>>> Taher Koitawala
>>>>>>
>>>>>>

Re: [DISCUSS] Query external resources as Tables with Beam SQL

Reply via email to