Another two references are from how Flink and Spark uses HBase by SQL: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables
-Rui On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski <piotr.szuber...@polidea.com> wrote: > Thanks for the resources! I'll try to follow the BQ approach. I'd also add > something like flattened schema so the user can use simple types only. It > would be limited to BINARY values though. Something like: > CREATE EXTERNAL TABLE( > key VARCHAR NOT NULL, > family VARCHAR NOT NULL, > column VARCHAR NOT NULL, > value BINARY NOT NULL, > timestampMicros BIGINT NOT NULL > ) > The cells array would be flattened (denormalized) and easy to use. In > case of single-valued cells it would also be quite efficient. > > On 2020/11/05 00:22:44, Brian Hulette <bhule...@google.com> wrote: > > I think we should take a look at how BigTable is integrated with other > SQL > > systems. For example we could get some inspiration from BigQuery's > support > > for querying BigTable data [1]. It looks like by default it uses > something > > like (1), but they recognize this is difficult to process with SQL, so > they > > have an option you can set to elevate certain columns as sub-fields (more > > like (2)), and you can also indicate you only want to get the latest > value > > for each column. > > > > In any case this may be a good candidate for not requiring the user to > > actually specify a schema, and instead letting the table be fully > > determined by options. > > > > [1] https://cloud.google.com/bigquery/external-data-bigtable > > > > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski < > piotr.szuber...@polidea.com> > > wrote: > > > > > I've dug the topic a bit and I think the 2nd approach will fit better. > The > > > schema in Bigtable is not supposed to change that often and specifying > our > > > own schema is more SQL-like and will cause less potential trouble. > > > > > > On 2020/11/03 11:01:57, Piotr Szuberski <piotr.szuber...@polidea.com> > > > wrote: > > > > I'm going to write Bigtable table for BeamSQL and I have a question > > > about the schema design, which one would be preferrable. > > > > > > > > Bigtable stores its data in a table with rows that contain a key and > > > 3-dimensional array where the 1st dimension is families with a names, > 2nd > > > dimension is columns with qualifiers and the 3rd cells containing > timestamp > > > and value. > > > > > > > > Two design solutions come to mind: > > > > 1) Fix schema to be a generic Bigtable row: > > > > > > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value, > > > timestamp))))))) > > > > > > > > Then the table creation definition would always be in form: > > > > > > > > CREATE TABLE bigtableexample1() > > > > TYPE 'bigtable' > > > > LOCATION ' > > > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId > > > ' > > > > > > > > 2) Let the user design his schema by providing the desired families > and > > > columns it sth like: > > > > CREATE TABLE bigtableexample2( > > > > key VARCHAR, > > > > family1 ROW< > > > > column1 ROW< > > > > cells ARRAY<ROW< > > > > value VARCHAR, > > > > timestamp BIGINT > > > > >> > > > > >, > > > > column2 ROW< > > > > cells ARRAY<ROW< > > > > value VARCHAR, > > > > timestamp BIGINT > > > > >> > > > > > > > > > > > > > > ) > > > > TYPE 'bigtable' > > > > LOCATION ' > > > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId > > > ' > > > > > > > > For me the 1st approach is more user friendly (typing schema from the > > > 2nd would be troublesome) and more elastic especially when the row's > schema > > > (families and columns) changes and a user wants to perform 'SELECT * > from > > > bigtableexampleX'. > > > > > > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option that > will > > > be a better one? > > > > > > > > > >