I think we should take a look at how BigTable is integrated with other SQL
systems. For example we could get some inspiration from BigQuery's support
for querying BigTable data [1]. It looks like by default it uses something
like (1), but they recognize this is difficult to process with SQL, so they
have an option you can set to elevate certain columns as sub-fields (more
like (2)), and you can also indicate you only want to get the latest value
for each column.

In any case this may be a good candidate for not requiring the user to
actually specify a schema, and instead letting the table be fully
determined by options.

[1] https://cloud.google.com/bigquery/external-data-bigtable

On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <piotr.szuber...@polidea.com>
wrote:

> I've dug the topic a bit and I think the 2nd approach will fit better. The
> schema in Bigtable is not supposed to change that often and specifying our
> own schema is more SQL-like and will cause less potential trouble.
>
> On 2020/11/03 11:01:57, Piotr Szuberski <piotr.szuber...@polidea.com>
> wrote:
> > I'm going to write Bigtable table for BeamSQL and I have a question
> about the schema design, which one would be preferrable.
> >
> > Bigtable stores its data in a table with rows that contain a key and
> 3-dimensional array where the 1st dimension is families with a names, 2nd
> dimension is columns with qualifiers and the 3rd cells containing timestamp
> and value.
> >
> > Two design solutions come to mind:
> > 1) Fix schema to be a generic Bigtable row:
> >
> > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> timestamp)))))))
> >
> > Then the table creation definition would always be in form:
> >
> > CREATE TABLE bigtableexample1()
> > TYPE 'bigtable'
> > LOCATION '
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> '
> >
> > 2) Let the user design his schema by providing the desired families and
> columns it sth like:
> > CREATE TABLE bigtableexample2(
> >   key VARCHAR,
> >   family1 ROW<
> >     column1 ROW<
> >       cells ARRAY<ROW<
> >         value VARCHAR,
> >         timestamp BIGINT
> >       >>
> >     >,
> >     column2 ROW<
> >       cells ARRAY<ROW<
> >         value VARCHAR,
> >         timestamp BIGINT
> >       >>
> >     >
> >   >
> > )
> > TYPE 'bigtable'
> > LOCATION '
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> '
> >
> > For me the 1st approach is more user friendly (typing schema from the
> 2nd would be troublesome) and more elastic especially when the row's schema
> (families and columns) changes and a user wants to perform 'SELECT * from
> bigtableexampleX'.
> >
> > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that will
> be a better one?
> >
>

Reply via email to