Re: [DISCUSS] Query external resources as Tables with Beam SQL

Andrew Pilloud Thu, 05 Mar 2020 11:20:21 -0800

For BigQueryIO, "CREATE EXTERNAL TABLE" does exactly what you describe in
"CREATE TABLE". You could add a table property to set the CreateDisposition
if you wanted to change that behavior.


Andrew

On Thu, Mar 5, 2020 at 11:10 AM Rui Wang <ruw...@google.com> wrote:

> "CREATE TABLE" can be used to indicate if a table does not exist, BeamSQL
> will help create it in storage systems if allowed, while "CREATE EXTERNAL
> TABLE" can be used only for registering a table, no matter if the table
> exists or not. BeamSQL provides a finer-grained way to distinct
> different behaviours.
>
> In both cases BeamSQL does not store the table. Another approach is to
> leverage the options/table property to specify the expected behaviour.
>
>
> -Rui
>
> On Thu, Mar 5, 2020 at 10:55 AM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> I'm not following the "CREATE TABLE" vs "CREATE EXTERNAL TABLE"
>> distinction. We added the "EXTERNAL" to make it clear that Beam wasn't
>> storing the table. Most of our current table providers will create the
>> underlying table as needed.
>>
>> Andrew
>>
>> On Thu, Mar 5, 2020 at 10:47 AM Rui Wang <ruw...@google.com> wrote:
>>
>>> There are two pieces of news from the proposal:
>>> 1. Spanner source in SQL. (Welcome to contribute it)
>>> 2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is
>>> whether assuming the table exists or not)
>>>
>>>
>>> There is a table property in the statement already that you can reuse to
>>> save your options.
>>>
>>>
>>> -Rui
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <taher...@gmail.com>
>>> wrote:
>>>
>>>> Also auto creation is not there
>>>>
>>>> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <taher...@gmail.com>
>>>> wrote:
>>>>
>>>>> Proposal is to add more sources and also have time event time or
>>>>> processing enhancements further on them
>>>>>
>>>>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <apill...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I believe we have this functionality alredy:
>>>>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/
>>>>>>
>>>>>> Existing GCP tables can also be loaded through the GCP datacatalog
>>>>>> metastore. What are you proposing that is new?
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <taher...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>          We have been using Apache Beam extensively to process huge
>>>>>>> amounts of data, while beam is really powerful and can solve a huge 
>>>>>>> number
>>>>>>> of use cases. A Beam job's development and testing time is significantly
>>>>>>> high.
>>>>>>>
>>>>>>>    This gap can be filled with Beam SQL, where a complete SQL based
>>>>>>> interface can reduce development and testing time to matter of minutes, 
>>>>>>> it
>>>>>>> also makes Apache Beam more user friendly where a wide variety of 
>>>>>>> audience
>>>>>>> with different analytical skillsets can interact.
>>>>>>>
>>>>>>> The current Beam SQL is still needs to be used programmatically, and
>>>>>>> so I propose the following additions/improvements.
>>>>>>>
>>>>>>> *Note: Whist the below given examples are more GCP biased, they
>>>>>>> apply to other sources in a generic manner*
>>>>>>>
>>>>>>> For Example: Imagine a user who wants to write a stream processing
>>>>>>> job on Google Cloud Dataflow. The user wants to process credit card
>>>>>>> transaction streams from Google Cloud PubSub (Something like Kafka) and
>>>>>>> enrich each record of the stream with some data that is stored in Google
>>>>>>> Cloud Spanner, after enrichment the user wishes to write the following 
>>>>>>> data
>>>>>>> to Google Cloud BigQuery.
>>>>>>>
>>>>>>> Given Below are the queries which the user should be able to fire on
>>>>>>> Beam and the rest should be automatically handled by the framework.
>>>>>>>
>>>>>>> //Infer schema from Spanner table upon table creation
>>>>>>>
>>>>>>> CREATE TABLE SPANNER_CARD_INFO
>>>>>>>
>>>>>>> OPTIONS (
>>>>>>>
>>>>>>>  ProjectId: “gcp-project”,
>>>>>>>
>>>>>>>  InstanceId : “spanner-instance-id”,
>>>>>>>
>>>>>>>  Database: “some-database”,
>>>>>>>
>>>>>>>  Table: “card_info”,
>>>>>>>
>>>>>>>  CloudResource: “SPANNER”,
>>>>>>>
>>>>>>> CreateTableIfNotExists: “FALSE”
>>>>>>>
>>>>>>>   )
>>>>>>>  //Apply schema to each record read from pubsub, and then apply SQL.
>>>>>>>
>>>>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC
>>>>>>>
>>>>>>> OPTIONS (
>>>>>>>
>>>>>>> ProjectId: “gcp-project”,
>>>>>>>
>>>>>>> Topic: “card-transactions”,
>>>>>>>
>>>>>>> CloudResource : “PUBSUB”
>>>>>>>
>>>>>>> SubscriptionId : “subscriptionId-1”,
>>>>>>>
>>>>>>> CreateTopicIfNotExists: “FALSE”,
>>>>>>>
>>>>>>> CreateSubscriptionIfNotExist: “TRUE”,
>>>>>>>
>>>>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc
>>>>>>>
>>>>>>> JsonRecordSchema : “{
>>>>>>>
>>>>>>> “CardNumber” : “INT”,
>>>>>>>
>>>>>>> “Amount”: “DOUBLE”,
>>>>>>>
>>>>>>> “eventTimeStamp” : “EVENT_TIME”
>>>>>>>
>>>>>>> }”)
>>>>>>>
>>>>>>> //Create table in BigQuery if not exists and insert
>>>>>>>
>>>>>>> CREATE TABLE TRANSACTION_HISTORY
>>>>>>>
>>>>>>> OPTIONS (
>>>>>>>
>>>>>>> ProjectId: “gcp-project”,
>>>>>>>
>>>>>>> CloudResource : “BIGQUERY”
>>>>>>>
>>>>>>> dataset: “dataset1”,
>>>>>>>
>>>>>>> table : “table1”,
>>>>>>>
>>>>>>> CreateTableIfNotExists: “TRUE”,
>>>>>>>
>>>>>>> TableSchema : “
>>>>>>>
>>>>>>> {
>>>>>>>
>>>>>>> “card_number” : “INT”,
>>>>>>>
>>>>>>> “first_name” : “STRING”,
>>>>>>>
>>>>>>> “last_name” : “STRING”,
>>>>>>>
>>>>>>> “phone” : “INT”,
>>>>>>>
>>>>>>> “city” : “STRING”,
>>>>>>>
>>>>>>> “amount”: “FLOAT”,
>>>>>>>
>>>>>>> “eventtimestamp” : “INT”,
>>>>>>>
>>>>>>> }”)
>>>>>>>
>>>>>>> //Actual query that should get stretched to a Beam dag
>>>>>>>
>>>>>>> INSERT INTO TRANSACTION_HISTORY
>>>>>>>
>>>>>>> SELECT
>>>>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp
>>>>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO
>>>>>>> spanner on (pubsub.card_number = spanner.card_number);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Also to consider that if any of the sources or sinks change, we only
>>>>>>> change the SQL and done!.
>>>>>>>
>>>>>>> Please let me know your thoughts about this.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Taher Koitawala
>>>>>>>
>>>>>>>

Re: [DISCUSS] Query external resources as Tables with Beam SQL

Reply via email to