Thanks for the resources! I'll try to follow the BQ approach. I'd also add 
something like flattened schema so the user can use simple types only. It would 
be limited to BINARY values though. Something like:
CREATE EXTERNAL TABLE(
  key VARCHAR NOT NULL,
  family VARCHAR NOT NULL,
  column VARCHAR NOT NULL,
  value BINARY NOT NULL,
  timestampMicros BIGINT NOT NULL
)
 The cells array would be flattened (denormalized) and easy to use. In case of 
single-valued cells it would also be quite efficient.

On 2020/11/05 00:22:44, Brian Hulette <[email protected]> wrote: 
> I think we should take a look at how BigTable is integrated with other SQL
> systems. For example we could get some inspiration from BigQuery's support
> for querying BigTable data [1]. It looks like by default it uses something
> like (1), but they recognize this is difficult to process with SQL, so they
> have an option you can set to elevate certain columns as sub-fields (more
> like (2)), and you can also indicate you only want to get the latest value
> for each column.
> 
> In any case this may be a good candidate for not requiring the user to
> actually specify a schema, and instead letting the table be fully
> determined by options.
> 
> [1] https://cloud.google.com/bigquery/external-data-bigtable
> 
> On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <[email protected]>
> wrote:
> 
> > I've dug the topic a bit and I think the 2nd approach will fit better. The
> > schema in Bigtable is not supposed to change that often and specifying our
> > own schema is more SQL-like and will cause less potential trouble.
> >
> > On 2020/11/03 11:01:57, Piotr Szuberski <[email protected]>
> > wrote:
> > > I'm going to write Bigtable table for BeamSQL and I have a question
> > about the schema design, which one would be preferrable.
> > >
> > > Bigtable stores its data in a table with rows that contain a key and
> > 3-dimensional array where the 1st dimension is families with a names, 2nd
> > dimension is columns with qualifiers and the 3rd cells containing timestamp
> > and value.
> > >
> > > Two design solutions come to mind:
> > > 1) Fix schema to be a generic Bigtable row:
> > >
> > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> > timestamp)))))))
> > >
> > > Then the table creation definition would always be in form:
> > >
> > > CREATE TABLE bigtableexample1()
> > > TYPE 'bigtable'
> > > LOCATION '
> > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > '
> > >
> > > 2) Let the user design his schema by providing the desired families and
> > columns it sth like:
> > > CREATE TABLE bigtableexample2(
> > >   key VARCHAR,
> > >   family1 ROW<
> > >     column1 ROW<
> > >       cells ARRAY<ROW<
> > >         value VARCHAR,
> > >         timestamp BIGINT
> > >       >>
> > >     >,
> > >     column2 ROW<
> > >       cells ARRAY<ROW<
> > >         value VARCHAR,
> > >         timestamp BIGINT
> > >       >>
> > >     >
> > >   >
> > > )
> > > TYPE 'bigtable'
> > > LOCATION '
> > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > '
> > >
> > > For me the 1st approach is more user friendly (typing schema from the
> > 2nd would be troublesome) and more elastic especially when the row's schema
> > (families and columns) changes and a user wants to perform 'SELECT * from
> > bigtableexampleX'.
> > >
> > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will
> > be a better one?
> > >
> >
> 

Reply via email to