Re: Bigtable for BeamSQL - question about the schema design

Brian Hulette Tue, 10 Nov 2020 10:25:13 -0800

On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski <[email protected]>
wrote:


> Unfortunately according to the documentation, BeamSQL doesn't work well
> with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.
>
>
Is there a jira for this issue?


> The only way to retrieve array's values was to get it by index, e.g.
> SELECT t.complex_array[1].row_field from some_table t;
>
> Unnest and just taking an array doesn't work with ARRAY<ROW>.
>
> I think that if a user wants to have a list of cells then it has to be
> truncated to "value" only and just the recent cell if it has to be ROW with
> timestamp, labels etc.
>
> This limitation doesn't happen in the flattened rows, but another one
> takes place - the value has to be of BINARY type and then parsed by user.
>
> I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it much
> less straightforward.
>
> Thanks for the references! As I understand Flink cares only for the recent
> values of the column and ignores the timestamps and labels? I refer to
> this:
> family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the q1
> column?
>

I can't find a reference to confirm, but that seems like the most likely
explanation. Similar to how BigQuery behaves with the onlyReadLatest option.


>
> On 2020/11/06 20:43:03, Ismaël Mejía <[email protected]> wrote:
> > Thanks for the references Rui. I think it is worth to consider how
> > open source systems do it.
> > The great thing about this is that we could 'easily' map Piotr's work
> > for Bigtable to HBase too once it is done.

>
> > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <[email protected]> wrote:
> > >
> > > Another two references are from how Flink and Spark uses HBase by SQL:
>

Great point! I forgot that HBase is modeled after BigTable


> > >
> > >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
> > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
> > >
> > > -Rui
> > >
> > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <
> [email protected]> wrote:
> > >>
> > >> Thanks for the resources! I'll try to follow the BQ approach. I'd
> also add something like flattened schema so the user can use simple types
> only. It would be limited to BINARY values though. Something like:
> > >> CREATE EXTERNAL TABLE(
> > >>   key VARCHAR NOT NULL,
> > >>   family VARCHAR NOT NULL,
> > >>   column VARCHAR NOT NULL,
> > >>   value BINARY NOT NULL,
> > >>   timestampMicros BIGINT NOT NULL
> > >> )
> > >>  The cells array would be flattened (denormalized) and easy to use.
> In case of single-valued cells it would also be quite efficient.
> > >>
> > >> On 2020/11/05 00:22:44, Brian Hulette <[email protected]> wrote:
> > >> > I think we should take a look at how BigTable is integrated with
> other SQL
> > >> > systems. For example we could get some inspiration from BigQuery's
> support
> > >> > for querying BigTable data [1]. It looks like by default it uses
> something
> > >> > like (1), but they recognize this is difficult to process with SQL,
> so they
> > >> > have an option you can set to elevate certain columns as sub-fields
> (more
> > >> > like (2)), and you can also indicate you only want to get the
> latest value
> > >> > for each column.
> > >> >
> > >> > In any case this may be a good candidate for not requiring the user
> to
> > >> > actually specify a schema, and instead letting the table be fully
> > >> > determined by options.
> > >> >
> > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
> > >> >
> > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
> [email protected]>
> > >> > wrote:
> > >> >
> > >> > > I've dug the topic a bit and I think the 2nd approach will fit
> better. The
> > >> > > schema in Bigtable is not supposed to change that often and
> specifying our
> > >> > > own schema is more SQL-like and will cause less potential trouble.
> > >> > >
> > >> > > On 2020/11/03 11:01:57, Piotr Szuberski <
> [email protected]>
> > >> > > wrote:
> > >> > > > I'm going to write Bigtable table for BeamSQL and I have a
> question
> > >> > > about the schema design, which one would be preferrable.
> > >> > > >
> > >> > > > Bigtable stores its data in a table with rows that contain a
> key and
> > >> > > 3-dimensional array where the 1st dimension is families with a
> names, 2nd
> > >> > > dimension is columns with qualifiers and the 3rd cells containing
> timestamp
> > >> > > and value.
> > >> > > >
> > >> > > > Two design solutions come to mind:
> > >> > > > 1) Fix schema to be a generic Bigtable row:
> > >> > > >
> > >> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> > >> > > timestamp)))))))
> > >> > > >
> > >> > > > Then the table creation definition would always be in form:
> > >> > > >
> > >> > > > CREATE TABLE bigtableexample1()
> > >> > > > TYPE 'bigtable'
> > >> > > > LOCATION '
> > >> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > >> > > '
> > >> > > >
> > >> > > > 2) Let the user design his schema by providing the desired
> families and
> > >> > > columns it sth like:
> > >> > > > CREATE TABLE bigtableexample2(
> > >> > > >   key VARCHAR,
> > >> > > >   family1 ROW<
> > >> > > >     column1 ROW<
> > >> > > >       cells ARRAY<ROW<
> > >> > > >         value VARCHAR,
> > >> > > >         timestamp BIGINT
> > >> > > >       >>
> > >> > > >     >,
> > >> > > >     column2 ROW<
> > >> > > >       cells ARRAY<ROW<
> > >> > > >         value VARCHAR,
> > >> > > >         timestamp BIGINT
> > >> > > >       >>
> > >> > > >     >
> > >> > > >   >
> > >> > > > )
> > >> > > > TYPE 'bigtable'
> > >> > > > LOCATION '
> > >> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > >> > > '
> > >> > > >
> > >> > > > For me the 1st approach is more user friendly (typing schema
> from the
> > >> > > 2nd would be troublesome) and more elastic especially when the
> row's schema
> > >> > > (families and columns) changes and a user wants to perform
> 'SELECT * from
> > >> > > bigtableexampleX'.
> > >> > > >
> > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option
> that will
> > >> > > be a better one?
> > >> > > >
> > >> > >
> > >> >
> >
>

Re: Bigtable for BeamSQL - question about the schema design

Reply via email to