On Tue, Nov 10, 2020 at 10:25 AM Brian Hulette <bhule...@google.com> wrote:
> > > On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski < > piotr.szuber...@polidea.com> wrote: > >> Unfortunately according to the documentation, BeamSQL doesn't work well >> with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically. >> >> > Is there a jira for this issue? > https://issues.apache.org/jira/browse/BEAM-10896 is the one that I am aware of. Though it says to aim to improve UNNEST, I think it could improve ARRAY<STRUCT> in general. Also like Kenneth mentioned, it might depend on vendored Calcite upgrade to at least 1.23.0 > > >> The only way to retrieve array's values was to get it by index, e.g. >> SELECT t.complex_array[1].row_field from some_table t; >> >> Unnest and just taking an array doesn't work with ARRAY<ROW>. >> >> I think that if a user wants to have a list of cells then it has to be >> truncated to "value" only and just the recent cell if it has to be ROW with >> timestamp, labels etc. >> >> This limitation doesn't happen in the flattened rows, but another one >> takes place - the value has to be of BINARY type and then parsed by user. >> >> I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it >> much less straightforward. >> >> Thanks for the references! As I understand Flink cares only for the >> recent values of the column and ignores the timestamps and labels? I refer >> to this: >> family1 ROW<q1 INT> -- family1.q1 is the most recent cell value of the >> q1 column? >> > > I can't find a reference to confirm, but that seems like the most likely > explanation. Similar to how BigQuery behaves with the onlyReadLatest option. > > >> >> On 2020/11/06 20:43:03, Ismaël Mejía <ieme...@gmail.com> wrote: >> > Thanks for the references Rui. I think it is worth to consider how >> > open source systems do it. >> > The great thing about this is that we could 'easily' map Piotr's work >> > for Bigtable to HBase too once it is done. > > > >> > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <ruw...@google.com> wrote: >> > > >> > > Another two references are from how Flink and Spark uses HBase by SQL: >> > > Great point! I forgot that HBase is modeled after BigTable > > >> > > >> > > >> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html >> > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables >> > > >> > > -Rui >> > > >> > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski < >> piotr.szuber...@polidea.com> wrote: >> > >> >> > >> Thanks for the resources! I'll try to follow the BQ approach. I'd >> also add something like flattened schema so the user can use simple types >> only. It would be limited to BINARY values though. Something like: >> > >> CREATE EXTERNAL TABLE( >> > >> key VARCHAR NOT NULL, >> > >> family VARCHAR NOT NULL, >> > >> column VARCHAR NOT NULL, >> > >> value BINARY NOT NULL, >> > >> timestampMicros BIGINT NOT NULL >> > >> ) >> > >> The cells array would be flattened (denormalized) and easy to use. >> In case of single-valued cells it would also be quite efficient. >> > >> >> > >> On 2020/11/05 00:22:44, Brian Hulette <bhule...@google.com> wrote: >> > >> > I think we should take a look at how BigTable is integrated with >> other SQL >> > >> > systems. For example we could get some inspiration from BigQuery's >> support >> > >> > for querying BigTable data [1]. It looks like by default it uses >> something >> > >> > like (1), but they recognize this is difficult to process with >> SQL, so they >> > >> > have an option you can set to elevate certain columns as >> sub-fields (more >> > >> > like (2)), and you can also indicate you only want to get the >> latest value >> > >> > for each column. >> > >> > >> > >> > In any case this may be a good candidate for not requiring the >> user to >> > >> > actually specify a schema, and instead letting the table be fully >> > >> > determined by options. >> > >> > >> > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable >> > >> > >> > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski < >> piotr.szuber...@polidea.com> >> > >> > wrote: >> > >> > >> > >> > > I've dug the topic a bit and I think the 2nd approach will fit >> better. The >> > >> > > schema in Bigtable is not supposed to change that often and >> specifying our >> > >> > > own schema is more SQL-like and will cause less potential >> trouble. >> > >> > > >> > >> > > On 2020/11/03 11:01:57, Piotr Szuberski < >> piotr.szuber...@polidea.com> >> > >> > > wrote: >> > >> > > > I'm going to write Bigtable table for BeamSQL and I have a >> question >> > >> > > about the schema design, which one would be preferrable. >> > >> > > > >> > >> > > > Bigtable stores its data in a table with rows that contain a >> key and >> > >> > > 3-dimensional array where the 1st dimension is families with a >> names, 2nd >> > >> > > dimension is columns with qualifiers and the 3rd cells >> containing timestamp >> > >> > > and value. >> > >> > > > >> > >> > > > Two design solutions come to mind: >> > >> > > > 1) Fix schema to be a generic Bigtable row: >> > >> > > > >> > >> > > > Row(key, Array(Row(family, Array(Row(qualifier, >> Array(Row(value, >> > >> > > timestamp))))))) >> > >> > > > >> > >> > > > Then the table creation definition would always be in form: >> > >> > > > >> > >> > > > CREATE TABLE bigtableexample1() >> > >> > > > TYPE 'bigtable' >> > >> > > > LOCATION ' >> > >> > > >> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId >> > >> > > ' >> > >> > > > >> > >> > > > 2) Let the user design his schema by providing the desired >> families and >> > >> > > columns it sth like: >> > >> > > > CREATE TABLE bigtableexample2( >> > >> > > > key VARCHAR, >> > >> > > > family1 ROW< >> > >> > > > column1 ROW< >> > >> > > > cells ARRAY<ROW< >> > >> > > > value VARCHAR, >> > >> > > > timestamp BIGINT >> > >> > > > >> >> > >> > > > >, >> > >> > > > column2 ROW< >> > >> > > > cells ARRAY<ROW< >> > >> > > > value VARCHAR, >> > >> > > > timestamp BIGINT >> > >> > > > >> >> > >> > > > > >> > >> > > > > >> > >> > > > ) >> > >> > > > TYPE 'bigtable' >> > >> > > > LOCATION ' >> > >> > > >> https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId >> > >> > > ' >> > >> > > > >> > >> > > > For me the 1st approach is more user friendly (typing schema >> from the >> > >> > > 2nd would be troublesome) and more elastic especially when the >> row's schema >> > >> > > (families and columns) changes and a user wants to perform >> 'SELECT * from >> > >> > > bigtableexampleX'. >> > >> > > > >> > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option >> that will >> > >> > > be a better one? >> > >> > > > >> > >> > > >> > >> > >> > >> >