Re: [DISCUSS] Query external resources as Tables with Beam SQL

Rui Wang Thu, 05 Mar 2020 13:15:24 -0800

Back to this proposal, I think it's ok if there is a need to
further distinguish the create/not create behaviour by either options or
using "create external table/create table".




-Rui

On Thu, Mar 5, 2020 at 11:19 AM Andrew Pilloud <apill...@google.com> wrote:

> For BigQueryIO, "CREATE EXTERNAL TABLE" does exactly what you describe in
> "CREATE TABLE". You could add a table property to set the CreateDisposition
> if you wanted to change that behavior.
>
> Andrew
>
> On Thu, Mar 5, 2020 at 11:10 AM Rui Wang <ruw...@google.com> wrote:
>
>> "CREATE TABLE" can be used to indicate if a table does not exist, BeamSQL
>> will help create it in storage systems if allowed, while "CREATE EXTERNAL
>> TABLE" can be used only for registering a table, no matter if the table
>> exists or not. BeamSQL provides a finer-grained way to distinct
>> different behaviours.
>>
>> In both cases BeamSQL does not store the table. Another approach is to
>> leverage the options/table property to specify the expected behaviour.
>>
>>
>> -Rui
>>
>> On Thu, Mar 5, 2020 at 10:55 AM Andrew Pilloud <apill...@google.com>
>> wrote:
>>
>>> I'm not following the "CREATE TABLE" vs "CREATE EXTERNAL TABLE"
>>> distinction. We added the "EXTERNAL" to make it clear that Beam wasn't
>>> storing the table. Most of our current table providers will create the
>>> underlying table as needed.
>>>
>>> Andrew
>>>
>>> On Thu, Mar 5, 2020 at 10:47 AM Rui Wang <ruw...@google.com> wrote:
>>>
>>>> There are two pieces of news from the proposal:
>>>> 1. Spanner source in SQL. (Welcome to contribute it)
>>>> 2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is
>>>> whether assuming the table exists or not)
>>>>
>>>>
>>>> There is a table property in the statement already that you can reuse
>>>> to save your options.
>>>>
>>>>
>>>> -Rui
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <taher...@gmail.com>
>>>> wrote:
>>>>
>>>>> Also auto creation is not there
>>>>>
>>>>> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <taher...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Proposal is to add more sources and also have time event time or
>>>>>> processing enhancements further on them
>>>>>>
>>>>>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <apill...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I believe we have this functionality alredy:
>>>>>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/
>>>>>>>
>>>>>>> Existing GCP tables can also be loaded through the GCP datacatalog
>>>>>>> metastore. What are you proposing that is new?
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <taher...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>          We have been using Apache Beam extensively to process huge
>>>>>>>> amounts of data, while beam is really powerful and can solve a huge 
>>>>>>>> number
>>>>>>>> of use cases. A Beam job's development and testing time is 
>>>>>>>> significantly
>>>>>>>> high.
>>>>>>>>
>>>>>>>>    This gap can be filled with Beam SQL, where a complete SQL based
>>>>>>>> interface can reduce development and testing time to matter of 
>>>>>>>> minutes, it
>>>>>>>> also makes Apache Beam more user friendly where a wide variety of 
>>>>>>>> audience
>>>>>>>> with different analytical skillsets can interact.
>>>>>>>>
>>>>>>>> The current Beam SQL is still needs to be used programmatically,
>>>>>>>> and so I propose the following additions/improvements.
>>>>>>>>
>>>>>>>> *Note: Whist the below given examples are more GCP biased, they
>>>>>>>> apply to other sources in a generic manner*
>>>>>>>>
>>>>>>>> For Example: Imagine a user who wants to write a stream processing
>>>>>>>> job on Google Cloud Dataflow. The user wants to process credit card
>>>>>>>> transaction streams from Google Cloud PubSub (Something like Kafka) and
>>>>>>>> enrich each record of the stream with some data that is stored in 
>>>>>>>> Google
>>>>>>>> Cloud Spanner, after enrichment the user wishes to write the following 
>>>>>>>> data
>>>>>>>> to Google Cloud BigQuery.
>>>>>>>>
>>>>>>>> Given Below are the queries which the user should be able to fire
>>>>>>>> on Beam and the rest should be automatically handled by the framework.
>>>>>>>>
>>>>>>>> //Infer schema from Spanner table upon table creation
>>>>>>>>
>>>>>>>> CREATE TABLE SPANNER_CARD_INFO
>>>>>>>>
>>>>>>>> OPTIONS (
>>>>>>>>
>>>>>>>>  ProjectId: “gcp-project”,
>>>>>>>>
>>>>>>>>  InstanceId : “spanner-instance-id”,
>>>>>>>>
>>>>>>>>  Database: “some-database”,
>>>>>>>>
>>>>>>>>  Table: “card_info”,
>>>>>>>>
>>>>>>>>  CloudResource: “SPANNER”,
>>>>>>>>
>>>>>>>> CreateTableIfNotExists: “FALSE”
>>>>>>>>
>>>>>>>>   )
>>>>>>>>  //Apply schema to each record read from pubsub, and then apply SQL.
>>>>>>>>
>>>>>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC
>>>>>>>>
>>>>>>>> OPTIONS (
>>>>>>>>
>>>>>>>> ProjectId: “gcp-project”,
>>>>>>>>
>>>>>>>> Topic: “card-transactions”,
>>>>>>>>
>>>>>>>> CloudResource : “PUBSUB”
>>>>>>>>
>>>>>>>> SubscriptionId : “subscriptionId-1”,
>>>>>>>>
>>>>>>>> CreateTopicIfNotExists: “FALSE”,
>>>>>>>>
>>>>>>>> CreateSubscriptionIfNotExist: “TRUE”,
>>>>>>>>
>>>>>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc
>>>>>>>>
>>>>>>>> JsonRecordSchema : “{
>>>>>>>>
>>>>>>>> “CardNumber” : “INT”,
>>>>>>>>
>>>>>>>> “Amount”: “DOUBLE”,
>>>>>>>>
>>>>>>>> “eventTimeStamp” : “EVENT_TIME”
>>>>>>>>
>>>>>>>> }”)
>>>>>>>>
>>>>>>>> //Create table in BigQuery if not exists and insert
>>>>>>>>
>>>>>>>> CREATE TABLE TRANSACTION_HISTORY
>>>>>>>>
>>>>>>>> OPTIONS (
>>>>>>>>
>>>>>>>> ProjectId: “gcp-project”,
>>>>>>>>
>>>>>>>> CloudResource : “BIGQUERY”
>>>>>>>>
>>>>>>>> dataset: “dataset1”,
>>>>>>>>
>>>>>>>> table : “table1”,
>>>>>>>>
>>>>>>>> CreateTableIfNotExists: “TRUE”,
>>>>>>>>
>>>>>>>> TableSchema : “
>>>>>>>>
>>>>>>>> {
>>>>>>>>
>>>>>>>> “card_number” : “INT”,
>>>>>>>>
>>>>>>>> “first_name” : “STRING”,
>>>>>>>>
>>>>>>>> “last_name” : “STRING”,
>>>>>>>>
>>>>>>>> “phone” : “INT”,
>>>>>>>>
>>>>>>>> “city” : “STRING”,
>>>>>>>>
>>>>>>>> “amount”: “FLOAT”,
>>>>>>>>
>>>>>>>> “eventtimestamp” : “INT”,
>>>>>>>>
>>>>>>>> }”)
>>>>>>>>
>>>>>>>> //Actual query that should get stretched to a Beam dag
>>>>>>>>
>>>>>>>> INSERT INTO TRANSACTION_HISTORY
>>>>>>>>
>>>>>>>> SELECT
>>>>>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp
>>>>>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO
>>>>>>>> spanner on (pubsub.card_number = spanner.card_number);
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Also to consider that if any of the sources or sinks change, we
>>>>>>>> only change the SQL and done!.
>>>>>>>>
>>>>>>>> Please let me know your thoughts about this.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Taher Koitawala
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Query external resources as Tables with Beam SQL

Reply via email to