"CREATE TABLE" can be used to indicate if a table does not exist, BeamSQL will help create it in storage systems if allowed, while "CREATE EXTERNAL TABLE" can be used only for registering a table, no matter if the table exists or not. BeamSQL provides a finer-grained way to distinct different behaviours.
In both cases BeamSQL does not store the table. Another approach is to leverage the options/table property to specify the expected behaviour. -Rui On Thu, Mar 5, 2020 at 10:55 AM Andrew Pilloud <apill...@google.com> wrote: > I'm not following the "CREATE TABLE" vs "CREATE EXTERNAL TABLE" > distinction. We added the "EXTERNAL" to make it clear that Beam wasn't > storing the table. Most of our current table providers will create the > underlying table as needed. > > Andrew > > On Thu, Mar 5, 2020 at 10:47 AM Rui Wang <ruw...@google.com> wrote: > >> There are two pieces of news from the proposal: >> 1. Spanner source in SQL. (Welcome to contribute it) >> 2. CREATE TABLE statement than CREATE EXTERNAL TABLE (the difference is >> whether assuming the table exists or not) >> >> >> There is a table property in the statement already that you can reuse to >> save your options. >> >> >> -Rui >> >> >> >> >> >> >> >> On Thu, Mar 5, 2020 at 2:30 AM Taher Koitawala <taher...@gmail.com> >> wrote: >> >>> Also auto creation is not there >>> >>> On Thu, Mar 5, 2020 at 3:59 PM Taher Koitawala <taher...@gmail.com> >>> wrote: >>> >>>> Proposal is to add more sources and also have time event time or >>>> processing enhancements further on them >>>> >>>> On Thu, Mar 5, 2020 at 3:50 PM Andrew Pilloud <apill...@google.com> >>>> wrote: >>>> >>>>> I believe we have this functionality alredy: >>>>> https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/ >>>>> >>>>> Existing GCP tables can also be loaded through the GCP datacatalog >>>>> metastore. What are you proposing that is new? >>>>> >>>>> Andrew >>>>> >>>>> >>>>> On Thu, Mar 5, 2020, 12:29 AM Taher Koitawala <taher...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> We have been using Apache Beam extensively to process huge >>>>>> amounts of data, while beam is really powerful and can solve a huge >>>>>> number >>>>>> of use cases. A Beam job's development and testing time is significantly >>>>>> high. >>>>>> >>>>>> This gap can be filled with Beam SQL, where a complete SQL based >>>>>> interface can reduce development and testing time to matter of minutes, >>>>>> it >>>>>> also makes Apache Beam more user friendly where a wide variety of >>>>>> audience >>>>>> with different analytical skillsets can interact. >>>>>> >>>>>> The current Beam SQL is still needs to be used programmatically, and >>>>>> so I propose the following additions/improvements. >>>>>> >>>>>> *Note: Whist the below given examples are more GCP biased, they apply >>>>>> to other sources in a generic manner* >>>>>> >>>>>> For Example: Imagine a user who wants to write a stream processing >>>>>> job on Google Cloud Dataflow. The user wants to process credit card >>>>>> transaction streams from Google Cloud PubSub (Something like Kafka) and >>>>>> enrich each record of the stream with some data that is stored in Google >>>>>> Cloud Spanner, after enrichment the user wishes to write the following >>>>>> data >>>>>> to Google Cloud BigQuery. >>>>>> >>>>>> Given Below are the queries which the user should be able to fire on >>>>>> Beam and the rest should be automatically handled by the framework. >>>>>> >>>>>> //Infer schema from Spanner table upon table creation >>>>>> >>>>>> CREATE TABLE SPANNER_CARD_INFO >>>>>> >>>>>> OPTIONS ( >>>>>> >>>>>> ProjectId: “gcp-project”, >>>>>> >>>>>> InstanceId : “spanner-instance-id”, >>>>>> >>>>>> Database: “some-database”, >>>>>> >>>>>> Table: “card_info”, >>>>>> >>>>>> CloudResource: “SPANNER”, >>>>>> >>>>>> CreateTableIfNotExists: “FALSE” >>>>>> >>>>>> ) >>>>>> //Apply schema to each record read from pubsub, and then apply SQL. >>>>>> >>>>>> CREATE TABLE TRANSACTIONS_PUBSUB_TOPIC >>>>>> >>>>>> OPTIONS ( >>>>>> >>>>>> ProjectId: “gcp-project”, >>>>>> >>>>>> Topic: “card-transactions”, >>>>>> >>>>>> CloudResource : “PUBSUB” >>>>>> >>>>>> SubscriptionId : “subscriptionId-1”, >>>>>> >>>>>> CreateTopicIfNotExists: “FALSE”, >>>>>> >>>>>> CreateSubscriptionIfNotExist: “TRUE”, >>>>>> >>>>>> RecordType: “JSON” //POssible values: Avro, JSON, TVS..etc >>>>>> >>>>>> JsonRecordSchema : “{ >>>>>> >>>>>> “CardNumber” : “INT”, >>>>>> >>>>>> “Amount”: “DOUBLE”, >>>>>> >>>>>> “eventTimeStamp” : “EVENT_TIME” >>>>>> >>>>>> }”) >>>>>> >>>>>> //Create table in BigQuery if not exists and insert >>>>>> >>>>>> CREATE TABLE TRANSACTION_HISTORY >>>>>> >>>>>> OPTIONS ( >>>>>> >>>>>> ProjectId: “gcp-project”, >>>>>> >>>>>> CloudResource : “BIGQUERY” >>>>>> >>>>>> dataset: “dataset1”, >>>>>> >>>>>> table : “table1”, >>>>>> >>>>>> CreateTableIfNotExists: “TRUE”, >>>>>> >>>>>> TableSchema : “ >>>>>> >>>>>> { >>>>>> >>>>>> “card_number” : “INT”, >>>>>> >>>>>> “first_name” : “STRING”, >>>>>> >>>>>> “last_name” : “STRING”, >>>>>> >>>>>> “phone” : “INT”, >>>>>> >>>>>> “city” : “STRING”, >>>>>> >>>>>> “amount”: “FLOAT”, >>>>>> >>>>>> “eventtimestamp” : “INT”, >>>>>> >>>>>> }”) >>>>>> >>>>>> //Actual query that should get stretched to a Beam dag >>>>>> >>>>>> INSERT INTO TRANSACTION_HISTORY >>>>>> >>>>>> SELECT >>>>>> pubsub.card_number,spanner.first_name,spanner.last_name,spanner.phone,spanner.city,pubsub.amount,pubsub.eventTimeStamp >>>>>> FROM TRANSACTIONS_PUBSUB_TOPIC pubsub join SPANNER_CARD_INFO spanner >>>>>> on (pubsub.card_number = spanner.card_number); >>>>>> >>>>>> >>>>>> >>>>>> Also to consider that if any of the sources or sinks change, we only >>>>>> change the SQL and done!. >>>>>> >>>>>> Please let me know your thoughts about this. >>>>>> >>>>>> Regards, >>>>>> Taher Koitawala >>>>>> >>>>>>