I like to avoid magic too. I might not have been entirely clear in what I
was asking. Here is an example of what I had in mind, replacing the
TBLPROPERTIES
with a more generic TIMESTAMP option:
CREATE TABLE table_name (
publishTimestamp TIMESTAMP,
attributes MAP(VARCHAR, VARCHAR),
payload ROW (
name VARCHAR,
age INTEGER,
isSWE BOOLEAN,
tags ARRAY(VARCHAR)))
TIMESTAMP attributes["createTime"];
Andrew
On Thu, May 3, 2018 at 12:47 PM Anton Kedin <[email protected]> wrote:
> I think it makes sense for the case when timestamp is provided in the
> payload (including pubsub message attributes). We can mark the field as an
> event timestamp. But if the timestamp is internally defined by the source
> (pubsub message publish time) and not exposed in the event body, then we
> need a source-specific mechanism to extract and map the event timestamp to
> the schema. This is, of course, if we don't automatically add a magic
> timestamp field which Beam SQL can populate behind the scenes and add to
> the schema. I want to avoid this magic path for now.
>
> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud <[email protected]>
> wrote:
>
>> This sounds awesome!
>>
>> Is event timestamp something that we need to specify for every source? If
>> so, I would suggest we add this as a first class option on CREATE TABLE
>> rather then something hidden in TBLPROPERTIES.
>>
>> Andrew
>>
>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin <[email protected]> wrote:
>>
>>> Hi
>>>
>>> I am working on adding functionality to support querying Pubsub messages
>>> directly from Beam SQL.
>>>
>>> *Goal*
>>> Provide Beam users a pure SQL solution to create the pipelines with
>>> Pubsub as a data source, without the need to set up the pipelines in
>>> Java before applying the query.
>>>
>>> *High level approach*
>>>
>>> -
>>> - Build on top of PubsubIO;
>>> - Pubsub source will be declared using CREATE TABLE DDL statement:
>>> - Beam SQL already supports declaring sources like Kafka and Text
>>> using CREATE TABLE DDL;
>>> - it supports additional configuration using TBLPROPERTIES
>>> clause. Currently it takes a text blob, where we can put a JSON
>>> configuration;
>>> - wrapping PubsubIO into a similar source looks feasible;
>>> - The plan is to initially support messages only with JSON payload:
>>> -
>>> - more payload formats can be added later;
>>> - Messages will be fully described in the CREATE TABLE statements:
>>> - event timestamps. Source of the timestamp is configurable. It
>>> is required by Beam SQL to have an explicit timestamp column for
>>> windowing
>>> support;
>>> - messages attributes map;
>>> - JSON payload schema;
>>> - Event timestamps will be taken either from publish time or
>>> user-specified message attribute (configurable);
>>>
>>> Thoughts, ideas, comments?
>>>
>>> More details are in the doc here:
>>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>>
>>>
>>> Thank you,
>>> Anton
>>>
>>