Back to this proposal, I think it's ok if there is a need to further distinguish the create/not create behaviour by either options or using "create external table/create table".
-Rui On Thu, Mar 5, 2020 at 11:19 AM Andrew Pilloud <apill...@google.com> wrote: > For BigQueryIO, "CREATE EXTERNAL TABLE" does exactly what you describe in > "CREATE TABLE". You could add a table property to set the CreateDisposition > if you wanted to change that behavior. > > Andrew > > On Thu, Mar 5, 2020 at 11:10 AM Rui Wang <ruw...@google.com> wrote: > >> "CREATE TABLE" can be used to indicate if a table does not exist, BeamSQL >> will help create it in storage systems if allowed, while "CREATE EXTERNAL >> TABLE" can be used only for registering a table, no matter if the table >> exists or not. BeamSQL provides a finer-grained way to distinct >> different behaviours. >> >> In both cases BeamSQL does not store the table. Another approach is to >> leverage the options/table property to specify the expected behaviour. >> >> >> -Rui >> >> On Thu, Mar 5, 2020 at 10:55 AM Andrew Pilloud <apill...@google.com> >> wrote: >> >>> I'm not following the "CREATE TABLE" vs "CREATE EXTERNAL TABLE" >>> distinction. We added the "EXTERNAL" to make it clear that Beam wasn't >>> storing the table. Most of our current table providers will create the >>> underlying table as needed. >>> >>> Andrew >>> >>> On Thu, Mar 5, 2020 at 10:47 AM Rui Wang <ruw...@google.com> wrote: >>> >>>> There are two pieces of news from the proposal: >>>> 1. Spanner source in SQL. (Welcome to contribute it) >>>> 2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is >>>> whether assuming the table exists or not) >>>> >>>> >>>> There is a table property in the statement already that you can reuse >>>> to save your options. >>>> >>>> >>>> -Rui >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <taher...@gmail.com> >>>> wrote: >>>> >>>>> Also auto creation is not there >>>>> >>>>> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <taher...@gmail.com> >>>>> wrote: >>>>> >>>>>> Proposal is to add more sources and also have time event time or >>>>>> processing enhancements further on them >>>>>> >>>>>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <apill...@google.com> >>>>>> wrote: >>>>>> >>>>>>> I believe we have this functionality alredy: >>>>>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/ >>>>>>> >>>>>>> Existing GCP tables can also be loaded through the GCP datacatalog >>>>>>> metastore. What are you proposing that is new? >>>>>>> >>>>>>> Andrew >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <taher...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> We have been using Apache Beam extensively to process huge >>>>>>>> amounts of data, while beam is really powerful and can solve a huge >>>>>>>> number >>>>>>>> of use cases. A Beam job's development and testing time is >>>>>>>> significantly >>>>>>>> high. >>>>>>>> >>>>>>>> This gap can be filled with Beam SQL, where a complete SQL based >>>>>>>> interface can reduce development and testing time to matter of >>>>>>>> minutes, it >>>>>>>> also makes Apache Beam more user friendly where a wide variety of >>>>>>>> audience >>>>>>>> with different analytical skillsets can interact. >>>>>>>> >>>>>>>> The current Beam SQL is still needs to be used programmatically, >>>>>>>> and so I propose the following additions/improvements. >>>>>>>> >>>>>>>> *Note: Whist the below given examples are more GCP biased, they >>>>>>>> apply to other sources in a generic manner* >>>>>>>> >>>>>>>> For Example: Imagine a user who wants to write a stream processing >>>>>>>> job on Google Cloud Dataflow. The user wants to process credit card >>>>>>>> transaction streams from Google Cloud PubSub (Something like Kafka) and >>>>>>>> enrich each record of the stream with some data that is stored in >>>>>>>> Google >>>>>>>> Cloud Spanner, after enrichment the user wishes to write the following >>>>>>>> data >>>>>>>> to Google Cloud BigQuery. >>>>>>>> >>>>>>>> Given Below are the queries which the user should be able to fire >>>>>>>> on Beam and the rest should be automatically handled by the framework. >>>>>>>> >>>>>>>> //Infer schema from Spanner table upon table creation >>>>>>>> >>>>>>>> CREATE TABLE SPANNER_CARD_INFO >>>>>>>> >>>>>>>> OPTIONS ( >>>>>>>> >>>>>>>> ProjectId: “gcp-project”, >>>>>>>> >>>>>>>> InstanceId : “spanner-instance-id”, >>>>>>>> >>>>>>>> Database: “some-database”, >>>>>>>> >>>>>>>> Table: “card_info”, >>>>>>>> >>>>>>>> CloudResource: “SPANNER”, >>>>>>>> >>>>>>>> CreateTableIfNotExists: “FALSE” >>>>>>>> >>>>>>>> ) >>>>>>>> //Apply schema to each record read from pubsub, and then apply SQL. >>>>>>>> >>>>>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC >>>>>>>> >>>>>>>> OPTIONS ( >>>>>>>> >>>>>>>> ProjectId: “gcp-project”, >>>>>>>> >>>>>>>> Topic: “card-transactions”, >>>>>>>> >>>>>>>> CloudResource : “PUBSUB” >>>>>>>> >>>>>>>> SubscriptionId : “subscriptionId-1”, >>>>>>>> >>>>>>>> CreateTopicIfNotExists: “FALSE”, >>>>>>>> >>>>>>>> CreateSubscriptionIfNotExist: “TRUE”, >>>>>>>> >>>>>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc >>>>>>>> >>>>>>>> JsonRecordSchema : “{ >>>>>>>> >>>>>>>> “CardNumber” : “INT”, >>>>>>>> >>>>>>>> “Amount”: “DOUBLE”, >>>>>>>> >>>>>>>> “eventTimeStamp” : “EVENT_TIME” >>>>>>>> >>>>>>>> }”) >>>>>>>> >>>>>>>> //Create table in BigQuery if not exists and insert >>>>>>>> >>>>>>>> CREATE TABLE TRANSACTION_HISTORY >>>>>>>> >>>>>>>> OPTIONS ( >>>>>>>> >>>>>>>> ProjectId: “gcp-project”, >>>>>>>> >>>>>>>> CloudResource : “BIGQUERY” >>>>>>>> >>>>>>>> dataset: “dataset1”, >>>>>>>> >>>>>>>> table : “table1”, >>>>>>>> >>>>>>>> CreateTableIfNotExists: “TRUE”, >>>>>>>> >>>>>>>> TableSchema : “ >>>>>>>> >>>>>>>> { >>>>>>>> >>>>>>>> “card_number” : “INT”, >>>>>>>> >>>>>>>> “first_name” : “STRING”, >>>>>>>> >>>>>>>> “last_name” : “STRING”, >>>>>>>> >>>>>>>> “phone” : “INT”, >>>>>>>> >>>>>>>> “city” : “STRING”, >>>>>>>> >>>>>>>> “amount”: “FLOAT”, >>>>>>>> >>>>>>>> “eventtimestamp” : “INT”, >>>>>>>> >>>>>>>> }”) >>>>>>>> >>>>>>>> //Actual query that should get stretched to a Beam dag >>>>>>>> >>>>>>>> INSERT INTO TRANSACTION_HISTORY >>>>>>>> >>>>>>>> SELECT >>>>>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp >>>>>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO >>>>>>>> spanner on (pubsub.card_number = spanner.card_number); >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Also to consider that if any of the sources or sinks change, we >>>>>>>> only change the SQL and done!. >>>>>>>> >>>>>>>> Please let me know your thoughts about this. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Taher Koitawala >>>>>>>> >>>>>>>>