Re: Bigtable for BeamSQL - question about the schema design

Piotr Szuberski Tue, 10 Nov 2020 05:46:48 -0800

Unfortunately according to the documentation, BeamSQL doesn't work well with 
ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.


The only way to retrieve array's values was to get it by index, e.g. 
SELECT t.complex_array[1].row_field from some_table t;

Unnest and just taking an array doesn't work with ARRAY<ROW>.

I think that if a user wants to have a list of cells then it has to be 
truncated to "value" only and just the recent cell if it has to be ROW with 
timestamp, labels etc.

This limitation doesn't happen in the flattened rows, but another one takes 
place - the value has to be of BINARY type and then parsed by user.

I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it much less 
straightforward.

Thanks for the references! As I understand Flink cares only for the recent 
values of the column and ignores the timestamps and labels? I refer to this: 
family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the q1 
column?

On 2020/11/06 20:43:03, Ismaël Mejía <ieme...@gmail.com> wrote: 
> Thanks for the references Rui. I think it is worth to consider how
> open source systems do it.
> The great thing about this is that we could 'easily' map Piotr's work
> for Bigtable to HBase too once it is done.
> 
> On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ruw...@google.com> wrote:
> >
> > Another two references are from how Flink and Spark uses HBase by SQL:
> >
> > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
> > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
> >
> > -Rui
> >
> > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski 
> > <piotr.szuber...@polidea.com> wrote:
> >>
> >> Thanks for the resources! I'll try to follow the BQ approach. I'd also add 
> >> something like flattened schema so the user can use simple types only. It 
> >> would be limited to BINARY values though. Something like:
> >> CREATE EXTERNAL TABLE(
> >>   key VARCHAR NOT NULL,
> >>   family VARCHAR NOT NULL,
> >>   column VARCHAR NOT NULL,
> >>   value BINARY NOT NULL,
> >>   timestampMicros BIGINT NOT NULL
> >> )
> >>  The cells array would be flattened (denormalized) and easy to use. In 
> >> case of single-valued cells it would also be quite efficient.
> >>
> >> On 2020/11/05 00:22:44, Brian Hulette <bhule...@google.com> wrote:
> >> > I think we should take a look at how BigTable is integrated with other 
> >> > SQL
> >> > systems. For example we could get some inspiration from BigQuery's 
> >> > support
> >> > for querying BigTable data [1]. It looks like by default it uses 
> >> > something
> >> > like (1), but they recognize this is difficult to process with SQL, so 
> >> > they
> >> > have an option you can set to elevate certain columns as sub-fields (more
> >> > like (2)), and you can also indicate you only want to get the latest 
> >> > value
> >> > for each column.
> >> >
> >> > In any case this may be a good candidate for not requiring the user to
> >> > actually specify a schema, and instead letting the table be fully
> >> > determined by options.
> >> >
> >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
> >> >
> >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski 
> >> > <piotr.szuber...@polidea.com>
> >> > wrote:
> >> >
> >> > > I've dug the topic a bit and I think the 2nd approach will fit better. 
> >> > > The
> >> > > schema in Bigtable is not supposed to change that often and specifying 
> >> > > our
> >> > > own schema is more SQL-like and will cause less potential trouble.
> >> > >
> >> > > On 2020/11/03 11:01:57, Piotr Szuberski <piotr.szuber...@polidea.com>
> >> > > wrote:
> >> > > > I'm going to write Bigtable table for BeamSQL and I have a question
> >> > > about the schema design, which one would be preferrable.
> >> > > >
> >> > > > Bigtable stores its data in a table with rows that contain a key and
> >> > > 3-dimensional array where the 1st dimension is families with a names, 
> >> > > 2nd
> >> > > dimension is columns with qualifiers and the 3rd cells containing 
> >> > > timestamp
> >> > > and value.
> >> > > >
> >> > > > Two design solutions come to mind:
> >> > > > 1) Fix schema to be a generic Bigtable row:
> >> > > >
> >> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> >> > > timestamp)))))))
> >> > > >
> >> > > > Then the table creation definition would always be in form:
> >> > > >
> >> > > > CREATE TABLE bigtableexample1()
> >> > > > TYPE 'bigtable'
> >> > > > LOCATION '
> >> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> >> > > '
> >> > > >
> >> > > > 2) Let the user design his schema by providing the desired families 
> >> > > > and
> >> > > columns it sth like:
> >> > > > CREATE TABLE bigtableexample2(
> >> > > >   key VARCHAR,
> >> > > >   family1 ROW<
> >> > > >     column1 ROW<
> >> > > >       cells ARRAY<ROW<
> >> > > >         value VARCHAR,
> >> > > >         timestamp BIGINT
> >> > > >       >>
> >> > > >     >,
> >> > > >     column2 ROW<
> >> > > >       cells ARRAY<ROW<
> >> > > >         value VARCHAR,
> >> > > >         timestamp BIGINT
> >> > > >       >>
> >> > > >     >
> >> > > >   >
> >> > > > )
> >> > > > TYPE 'bigtable'
> >> > > > LOCATION '
> >> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> >> > > '
> >> > > >
> >> > > > For me the 1st approach is more user friendly (typing schema from the
> >> > > 2nd would be troublesome) and more elastic especially when the row's 
> >> > > schema
> >> > > (families and columns) changes and a user wants to perform 'SELECT * 
> >> > > from
> >> > > bigtableexampleX'.
> >> > > >
> >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that 
> >> > > > will
> >> > > be a better one?
> >> > > >
> >> > >
> >> >
>

Re: Bigtable for BeamSQL - question about the schema design

Reply via email to