Another two references are from how Flink and Spark uses HBase by SQL:

https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html
https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables

-Rui

On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <piotr.szuber...@polidea.com>
wrote:

> Thanks for the resources! I'll try to follow the BQ approach. I'd also add
> something like flattened schema so the user can use simple types only. It
> would be limited to BINARY values though. Something like:
> CREATE EXTERNAL TABLE(
>   key VARCHAR NOT NULL,
>   family VARCHAR NOT NULL,
>   column VARCHAR NOT NULL,
>   value BINARY NOT NULL,
>   timestampMicros BIGINT NOT NULL
> )
>  The cells array would be flattened (denormalized) and easy to use. In
> case of single-valued cells it would also be quite efficient.
>
> On 2020/11/05 00:22:44, Brian Hulette <bhule...@google.com> wrote:
> > I think we should take a look at how BigTable is integrated with other
> SQL
> > systems. For example we could get some inspiration from BigQuery's
> support
> > for querying BigTable data [1]. It looks like by default it uses
> something
> > like (1), but they recognize this is difficult to process with SQL, so
> they
> > have an option you can set to elevate certain columns as sub-fields (more
> > like (2)), and you can also indicate you only want to get the latest
> value
> > for each column.
> >
> > In any case this may be a good candidate for not requiring the user to
> > actually specify a schema, and instead letting the table be fully
> > determined by options.
> >
> > [1] https://cloud.google.com/bigquery/external-data-bigtable
> >
> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski <
> piotr.szuber...@polidea.com>
> > wrote:
> >
> > > I've dug the topic a bit and I think the 2nd approach will fit better.
> The
> > > schema in Bigtable is not supposed to change that often and specifying
> our
> > > own schema is more SQL-like and will cause less potential trouble.
> > >
> > > On 2020/11/03 11:01:57, Piotr Szuberski <piotr.szuber...@polidea.com>
> > > wrote:
> > > > I'm going to write Bigtable table for BeamSQL and I have a question
> > > about the schema design, which one would be preferrable.
> > > >
> > > > Bigtable stores its data in a table with rows that contain a key and
> > > 3-dimensional array where the 1st dimension is families with a names,
> 2nd
> > > dimension is columns with qualifiers and the 3rd cells containing
> timestamp
> > > and value.
> > > >
> > > > Two design solutions come to mind:
> > > > 1) Fix schema to be a generic Bigtable row:
> > > >
> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value,
> > > timestamp)))))))
> > > >
> > > > Then the table creation definition would always be in form:
> > > >
> > > > CREATE TABLE bigtableexample1()
> > > > TYPE 'bigtable'
> > > > LOCATION '
> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > > '
> > > >
> > > > 2) Let the user design his schema by providing the desired families
> and
> > > columns it sth like:
> > > > CREATE TABLE bigtableexample2(
> > > >   key VARCHAR,
> > > >   family1 ROW<
> > > >     column1 ROW<
> > > >       cells ARRAY<ROW<
> > > >         value VARCHAR,
> > > >         timestamp BIGINT
> > > >       >>
> > > >     >,
> > > >     column2 ROW<
> > > >       cells ARRAY<ROW<
> > > >         value VARCHAR,
> > > >         timestamp BIGINT
> > > >       >>
> > > >     >
> > > >   >
> > > > )
> > > > TYPE 'bigtable'
> > > > LOCATION '
> > >
> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId
> > > '
> > > >
> > > > For me the 1st approach is more user friendly (typing schema from the
> > > 2nd would be troublesome) and more elastic especially when the row's
> schema
> > > (families and columns) changes and a user wants to perform 'SELECT *
> from
> > > bigtableexampleX'.
> > > >
> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that
> will
> > > be a better one?
> > > >
> > >
> >
>

Reply via email to