A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We will probably need to a way to expose a message publish timestamp if we want to use it as an event timestamp, but that will be consumed by the same wrapper/transform without adding anything schema or SQL-specific to PubsubIO itself.
On Thu, May 3, 2018 at 11:44 AM Reuven Lax <re...@google.com> wrote: > Are you planning on integrating this directly into PubSubIO, or add a > follow-on transform? > > On Wed, May 2, 2018 at 10:30 AM Anton Kedin <ke...@google.com> wrote: > >> Hi >> >> I am working on adding functionality to support querying Pubsub messages >> directly from Beam SQL. >> >> *Goal* >> Provide Beam users a pure SQL solution to create the pipelines with >> Pubsub as a data source, without the need to set up the pipelines in >> Java before applying the query. >> >> *High level approach* >> >> - >> - Build on top of PubsubIO; >> - Pubsub source will be declared using CREATE TABLE DDL statement: >> - Beam SQL already supports declaring sources like Kafka and Text >> using CREATE TABLE DDL; >> - it supports additional configuration using TBLPROPERTIES clause. >> Currently it takes a text blob, where we can put a JSON configuration; >> - wrapping PubsubIO into a similar source looks feasible; >> - The plan is to initially support messages only with JSON payload: >> - >> - more payload formats can be added later; >> - Messages will be fully described in the CREATE TABLE statements: >> - event timestamps. Source of the timestamp is configurable. It is >> required by Beam SQL to have an explicit timestamp column for windowing >> support; >> - messages attributes map; >> - JSON payload schema; >> - Event timestamps will be taken either from publish time or >> user-specified message attribute (configurable); >> >> Thoughts, ideas, comments? >> >> More details are in the doc here: >> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE >> >> >> Thank you, >> Anton >> >