Re: Bigtable for BeamSQL - question about the schema design

Rui Wang Mon, 16 Nov 2020 14:40:40 -0800

On Tue, Nov 10, 2020 at 10:25 AM Brian Hulette <[email protected]> wrote:


>
>
> On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski <
> [email protected]> wrote:
>
>> Unfortunately according to the documentation, BeamSQL doesn't work well
>> with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically.
>>
>>
> Is there a jira for this issue?
>

https://issues.apache.org/jira/browse/BEAM-10896 is the one that I am aware
of. Though it says to aim to improve UNNEST, I think it could improve
ARRAY<STRUCT> in general.  Also like Kenneth mentioned, it might depend on
vendored Calcite upgrade to at least 1.23.0


>
>
>> The only way to retrieve array's values was to get it by index, e.g.
>> SELECT t.complex_array[1].row_field from some_table t;
>>
>> Unnest and just taking an array doesn't work with ARRAY<ROW>.
>>
>> I think that if a user wants to have a list of cells then it has to be
>> truncated to "value" only and just the recent cell if it has to be ROW with
>> timestamp, labels etc.
>>
>> This limitation doesn't happen in the flattened rows, but another one
>> takes place - the value has to be of BINARY type and then parsed by user.
>>
>> I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it
>> much less straightforward.
>>
>> Thanks for the references! As I understand Flink cares only for the
>> recent values of the column and ignores the timestamps and labels? I refer
>> to this:
>> family1 ROW<q1 INT>  -- family1.q1 is the most recent cell value of the
>> q1 column?
>>
>
> I can't find a reference to confirm, but that seems like the most likely
> explanation. Similar to how BigQuery behaves with the onlyReadLatest option.
>
>
>>
>> On 2020/11/06 20:43:03, Ismaël Mejía <[email protected]> wrote:
>> > Thanks for the references Rui. I think it is worth to consider how
>> > open source systems do it.
>> > The great thing about this is that we could 'easily' map Piotr's work
>> > for Bigtable to HBase too once it is done.
>
> >
>> > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <[email protected]> wrote:
>> > >
>> > > Another two references are from how Flink and Spark uses HBase by SQL:
>>
>
> Great point! I forgot that HBase is modeled after BigTable
>
>
>> > >
>> > >
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
>> > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
>> > >
>> > > -Rui
>> > >
>> > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <
>> [email protected]> wrote:
>> > >>
>> > >> Thanks for the resources! I'll try to follow the BQ approach. I'd
>> also add something like flattened schema so the user can use simple types
>> only. It would be limited to BINARY values though. Something like:
>> > >> CREATE EXTERNAL TABLE(
>> > >>   key VARCHAR NOT NULL,
>> > >>   family VARCHAR NOT NULL,
>> > >>   column VARCHAR NOT NULL,
>> > >>   value BINARY NOT NULL,
>> > >>   timestampMicros BIGINT NOT NULL
>> > >> )
>> > >>  The cells array would be flattened (denormalized) and easy to use.
>> In case of single-valued cells it would also be quite efficient.
>> > >>
>> > >> On 2020/11/05 00:22:44, Brian Hulette <[email protected]> wrote:
>> > >> > I think we should take a look at how BigTable is integrated with
>> other SQL
>> > >> > systems. For example we could get some inspiration from BigQuery's
>> support
>> > >> > for querying BigTable data [1]. It looks like by default it uses
>> something
>> > >> > like (1), but they recognize this is difficult to process with
>> SQL, so they
>> > >> > have an option you can set to elevate certain columns as
>> sub-fields (more
>> > >> > like (2)), and you can also indicate you only want to get the
>> latest value
>> > >> > for each column.
>> > >> >
>> > >> > In any case this may be a good candidate for not requiring the
>> user to
>> > >> > actually specify a schema, and instead letting the table be fully
>> > >> > determined by options.
>> > >> >
>> > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable
>> > >> >
>> > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
>> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> > > I've dug the topic a bit and I think the 2nd approach will fit
>> better. The
>> > >> > > schema in Bigtable is not supposed to change that often and
>> specifying our
>> > >> > > own schema is more SQL-like and will cause less potential
>> trouble.
>> > >> > >
>> > >> > > On 2020/11/03 11:01:57, Piotr Szuberski <
>> [email protected]>
>> > >> > > wrote:
>> > >> > > > I'm going to write Bigtable table for BeamSQL and I have a
>> question
>> > >> > > about the schema design, which one would be preferrable.
>> > >> > > >
>> > >> > > > Bigtable stores its data in a table with rows that contain a
>> key and
>> > >> > > 3-dimensional array where the 1st dimension is families with a
>> names, 2nd
>> > >> > > dimension is columns with qualifiers and the 3rd cells
>> containing timestamp
>> > >> > > and value.
>> > >> > > >
>> > >> > > > Two design solutions come to mind:
>> > >> > > > 1) Fix schema to be a generic Bigtable row:
>> > >> > > >
>> > >> > > > Row(key, Array(Row(family, Array(Row(qualifier,
>> Array(Row(value,
>> > >> > > timestamp)))))))
>> > >> > > >
>> > >> > > > Then the table creation definition would always be in form:
>> > >> > > >
>> > >> > > > CREATE TABLE bigtableexample1()
>> > >> > > > TYPE 'bigtable'
>> > >> > > > LOCATION '
>> > >> > >
>> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > >> > > '
>> > >> > > >
>> > >> > > > 2) Let the user design his schema by providing the desired
>> families and
>> > >> > > columns it sth like:
>> > >> > > > CREATE TABLE bigtableexample2(
>> > >> > > >   key VARCHAR,
>> > >> > > >   family1 ROW<
>> > >> > > >     column1 ROW<
>> > >> > > >       cells ARRAY<ROW<
>> > >> > > >         value VARCHAR,
>> > >> > > >         timestamp BIGINT
>> > >> > > >       >>
>> > >> > > >     >,
>> > >> > > >     column2 ROW<
>> > >> > > >       cells ARRAY<ROW<
>> > >> > > >         value VARCHAR,
>> > >> > > >         timestamp BIGINT
>> > >> > > >       >>
>> > >> > > >     >
>> > >> > > >   >
>> > >> > > > )
>> > >> > > > TYPE 'bigtable'
>> > >> > > > LOCATION '
>> > >> > >
>> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > >> > > '
>> > >> > > >
>> > >> > > > For me the 1st approach is more user friendly (typing schema
>> from the
>> > >> > > 2nd would be troublesome) and more elastic especially when the
>> row's schema
>> > >> > > (families and columns) changes and a user wants to perform
>> 'SELECT * from
>> > >> > > bigtableexampleX'.
>> > >> > > >
>> > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option
>> that will
>> > >> > > be a better one?
>> > >> > > >
>> > >> > >
>> > >> >
>> >
>>
>

Re: Bigtable for BeamSQL - question about the schema design

Reply via email to