On Tue, Nov 10, 2020 at 5:46 AM Piotr Szuberski <[email protected]> wrote:
> Unfortunately according to the documentation, BeamSQL doesn't work well > with ARRAY<COMPLEX_TYPE>, like ARRAY<ROW<>> which I confirmed empirically. > > Is there a jira for this issue? > The only way to retrieve array's values was to get it by index, e.g. > SELECT t.complex_array[1].row_field from some_table t; > > Unnest and just taking an array doesn't work with ARRAY<ROW>. > > I think that if a user wants to have a list of cells then it has to be > truncated to "value" only and just the recent cell if it has to be ROW with > timestamp, labels etc. > > This limitation doesn't happen in the flattened rows, but another one > takes place - the value has to be of BINARY type and then parsed by user. > > I'll try to make it possibly elastic - ARRAY<ROW> limitation makes it much > less straightforward. > > Thanks for the references! As I understand Flink cares only for the recent > values of the column and ignores the timestamps and labels? I refer to > this: > family1 ROW<q1 INT> -- family1.q1 is the most recent cell value of the q1 > column? > I can't find a reference to confirm, but that seems like the most likely explanation. Similar to how BigQuery behaves with the onlyReadLatest option. > > On 2020/11/06 20:43:03, Ismaël Mejía <[email protected]> wrote: > > Thanks for the references Rui. I think it is worth to consider how > > open source systems do it. > > The great thing about this is that we could 'easily' map Piotr's work > > for Bigtable to HBase too once it is done. > > > On Fri, Nov 6, 2020 at 8:22 PM Rui Wang <[email protected]> wrote: > > > > > > Another two references are from how Flink and Spark uses HBase by SQL: > Great point! I forgot that HBase is modeled after BigTable > > > > > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hbase.html > > > https://stackoverflow.com/questions/39530938/sparksql-on-hbase-tables > > > > > > -Rui > > > > > > On Thu, Nov 5, 2020 at 9:46 AM Piotr Szuberski < > [email protected]> wrote: > > >> > > >> Thanks for the resources! I'll try to follow the BQ approach. I'd > also add something like flattened schema so the user can use simple types > only. It would be limited to BINARY values though. Something like: > > >> CREATE EXTERNAL TABLE( > > >> key VARCHAR NOT NULL, > > >> family VARCHAR NOT NULL, > > >> column VARCHAR NOT NULL, > > >> value BINARY NOT NULL, > > >> timestampMicros BIGINT NOT NULL > > >> ) > > >> The cells array would be flattened (denormalized) and easy to use. > In case of single-valued cells it would also be quite efficient. > > >> > > >> On 2020/11/05 00:22:44, Brian Hulette <[email protected]> wrote: > > >> > I think we should take a look at how BigTable is integrated with > other SQL > > >> > systems. For example we could get some inspiration from BigQuery's > support > > >> > for querying BigTable data [1]. It looks like by default it uses > something > > >> > like (1), but they recognize this is difficult to process with SQL, > so they > > >> > have an option you can set to elevate certain columns as sub-fields > (more > > >> > like (2)), and you can also indicate you only want to get the > latest value > > >> > for each column. > > >> > > > >> > In any case this may be a good candidate for not requiring the user > to > > >> > actually specify a schema, and instead letting the table be fully > > >> > determined by options. > > >> > > > >> > [1] https://cloud.google.com/bigquery/external-data-bigtable > > >> > > > >> > On Tue, Nov 3, 2020 at 11:41 PM Piotr Szuberski < > [email protected]> > > >> > wrote: > > >> > > > >> > > I've dug the topic a bit and I think the 2nd approach will fit > better. The > > >> > > schema in Bigtable is not supposed to change that often and > specifying our > > >> > > own schema is more SQL-like and will cause less potential trouble. > > >> > > > > >> > > On 2020/11/03 11:01:57, Piotr Szuberski < > [email protected]> > > >> > > wrote: > > >> > > > I'm going to write Bigtable table for BeamSQL and I have a > question > > >> > > about the schema design, which one would be preferrable. > > >> > > > > > >> > > > Bigtable stores its data in a table with rows that contain a > key and > > >> > > 3-dimensional array where the 1st dimension is families with a > names, 2nd > > >> > > dimension is columns with qualifiers and the 3rd cells containing > timestamp > > >> > > and value. > > >> > > > > > >> > > > Two design solutions come to mind: > > >> > > > 1) Fix schema to be a generic Bigtable row: > > >> > > > > > >> > > > Row(key, Array(Row(family, Array(Row(qualifier, Array(Row(value, > > >> > > timestamp))))))) > > >> > > > > > >> > > > Then the table creation definition would always be in form: > > >> > > > > > >> > > > CREATE TABLE bigtableexample1() > > >> > > > TYPE 'bigtable' > > >> > > > LOCATION ' > > >> > > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId > > >> > > ' > > >> > > > > > >> > > > 2) Let the user design his schema by providing the desired > families and > > >> > > columns it sth like: > > >> > > > CREATE TABLE bigtableexample2( > > >> > > > key VARCHAR, > > >> > > > family1 ROW< > > >> > > > column1 ROW< > > >> > > > cells ARRAY<ROW< > > >> > > > value VARCHAR, > > >> > > > timestamp BIGINT > > >> > > > >> > > >> > > > >, > > >> > > > column2 ROW< > > >> > > > cells ARRAY<ROW< > > >> > > > value VARCHAR, > > >> > > > timestamp BIGINT > > >> > > > >> > > >> > > > > > > >> > > > > > > >> > > > ) > > >> > > > TYPE 'bigtable' > > >> > > > LOCATION ' > > >> > > > https://googleapis.com/bigtable/projects/projectId/instances/instanceId/tables/tableId > > >> > > ' > > >> > > > > > >> > > > For me the 1st approach is more user friendly (typing schema > from the > > >> > > 2nd would be troublesome) and more elastic especially when the > row's schema > > >> > > (families and columns) changes and a user wants to perform > 'SELECT * from > > >> > > bigtableexampleX'. > > >> > > > > > >> > > > WDYT? I'd welcome any feedback. Maybe there is some 3rd option > that will > > >> > > be a better one? > > >> > > > > > >> > > > > >> > > > >
