Re: Bigtable for BeamSQL - question about the schema design

Ismaël Mejía Fri, 06 Nov 2020 12:44:16 -0800

Thanks for the references Rui. I think it is worth to consider how
open source systems do it.
The great thing about this is that we could 'easily' map Piotr's work
for Bigtable to HBase too once it is done.


On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <[email protected]> wrote:
>
> Another two references are from how Flink and Spark uses HBase by SQL:
>
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
> https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
>
> -Rui
>
> On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <[email protected]> 
> wrote:
>>
>> Thanks for the resources! I'll try to follow the BQ approach. I'd also add 
>> something like flattened schema so the user can use simple types only. It 
>> would be limited to BINARY values though. Something like:
>> CREATE EXTERNAL TABLE(
>>   key VARCHAR NOT NULL,
>>   family VARCHAR NOT NULL,
>>   column VARCHAR NOT NULL,
>>   value BINARY NOT NULL,
>>   timestampMicros BIGINT NOT NULL
>> )
>>  The cells array would be flattened (denormalized) and easy to use. In case 
>> of single-valued cells it would also be quite efficient.
>>
>> On 2020/11/05 00:22:44, Brian Hulette <[email protected]> wrote:
>> > I think we should take a look at how BigTable is integrated with other SQL
>> > systems. For example we could get some inspiration from BigQuery's support
>> > for querying BigTable data [1]. It looks like by default it uses something
>> > like (1), but they recognize this is difficult to process with SQL, so they
>> > have an option you can set to elevate certain columns as sub-fields (more
>> > like (2)), and you can also indicate you only want to get the latest value
>> > for each column.
>> >
>> > In any case this may be a good candidate for not requiring the user to
>> > actually specify a schema, and instead letting the table be fully
>> > determined by options.
>> >
>> > [1] https://cloud.google.com/bigquery/external-data-bigtable
>> >
>> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski 
>> > <[email protected]>
>> > wrote:
>> >
>> > > I've dug the topic a bit and I think the 2nd approach will fit better. 
>> > > The
>> > > schema in Bigtable is not supposed to change that often and specifying 
>> > > our
>> > > own schema is more SQL-like and will cause less potential trouble.
>> > >
>> > > On 2020/11/03 11:01:57, Piotr Szuberski <[email protected]>
>> > > wrote:
>> > > > I'm going to write Bigtable table for BeamSQL and I have a question
>> > > about the schema design, which one would be preferrable.
>> > > >
>> > > > Bigtable stores its data in a table with rows that contain a key and
>> > > 3-dimensional array where the 1st dimension is families with a names, 2nd
>> > > dimension is columns with qualifiers and the 3rd cells containing 
>> > > timestamp
>> > > and value.
>> > > >
>> > > > Two design solutions come to mind:
>> > > > 1) Fix schema to be a generic Bigtable row:
>> > > >
>> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
>> > > timestamp)))))))
>> > > >
>> > > > Then the table creation definition would always be in form:
>> > > >
>> > > > CREATE TABLE bigtableexample1()
>> > > > TYPE 'bigtable'
>> > > > LOCATION '
>> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > > '
>> > > >
>> > > > 2) Let the user design his schema by providing the desired families and
>> > > columns it sth like:
>> > > > CREATE TABLE bigtableexample2(
>> > > >   key VARCHAR,
>> > > >   family1 ROW<
>> > > >     column1 ROW<
>> > > >       cells ARRAY<ROW<
>> > > >         value VARCHAR,
>> > > >         timestamp BIGINT
>> > > >       >>
>> > > >     >,
>> > > >     column2 ROW<
>> > > >       cells ARRAY<ROW<
>> > > >         value VARCHAR,
>> > > >         timestamp BIGINT
>> > > >       >>
>> > > >     >
>> > > >   >
>> > > > )
>> > > > TYPE 'bigtable'
>> > > > LOCATION '
>> > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
>> > > '
>> > > >
>> > > > For me the 1st approach is more user friendly (typing schema from the
>> > > 2nd would be troublesome) and more elastic especially when the row's 
>> > > schema
>> > > (families and columns) changes and a user wants to perform 'SELECT * from
>> > > bigtableexampleX'.
>> > > >
>> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that 
>> > > > will
>> > > be a better one?
>> > > >
>> > >
>> >

Re: Bigtable for BeamSQL - question about the schema design

Reply via email to