Hi
I am working on adding functionality to support querying Pubsub messages
directly from Beam SQL.
*Goal*
Provide Beam users a pure SQL solution to create the pipelines with
Pubsub as a data source, without the need to set up the pipelines in Java
before applying the query.
*High level approach*
-
- Build on top of PubsubIO;
- Pubsub source will be declared using CREATE TABLE DDL statement:
- Beam SQL already supports declaring sources like Kafka and Text
using CREATE TABLE DDL;
- it supports additional configuration using TBLPROPERTIES clause.
Currently it takes a text blob, where we can put a JSON configuration;
- wrapping PubsubIO into a similar source looks feasible;
- The plan is to initially support messages only with JSON payload:
-
- more payload formats can be added later;
- Messages will be fully described in the CREATE TABLE statements:
- event timestamps. Source of the timestamp is configurable. It is
required by Beam SQL to have an explicit timestamp column for windowing
support;
- messages attributes map;
- JSON payload schema;
- Event timestamps will be taken either from publish time or
user-specified message attribute (configurable);
Thoughts, ideas, comments?
More details are in the doc here:
https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
Thank you,
Anton