Hi Ben,
Drill is a SQL relational engine that was enhanced to support extended JSON
types, including arrays. The need to use only SQL to work with array imposes
some major constraints.
As you noted, there is no way to project specific array items while leaving
them in an array. When you project array elements, as in your example, you
create new, top-level scalar columns. So:
SELECT columns[0] AS `name`, columns[1] AS `address` ...
Is a common use case: it not only projects certain array entries, it projects
them as new, named, top-level columns. There is no way to say: "give me a new
array with just elements 0 and 1 from the old array." Note that this problem
also occurs for maps.
Now, is this a problem? Not really. The primary clients for Drill are JDBC and
OBDC, neither of which can handle arrays. (The native Drill client, which
almost no one uses, can handle arrays and maps.) So, somewhere in the query,
maps and arrays must be "flattened" to scalar columns. So, it is actually handy
that the projection shown above both identifies the array items you want, AND
puts them into a form that ODBC/JDBC can use.
Your example, by the way, is specific to CSV (really, any delimited text file)
that does not have headers. It is natural to want to project those columns into
something named. SQL is much easier to read with "`name` = 'Bob'" than
"columns[0] = 'Bob'". As you noted, Parquet also supports arrays as does JSON.
As Charles pointed out, there are a number of array-aware functions available.
If you can't find what you need, recall that Drill is open: you can write your
own UDF. Then, you can do as Charles and others have done: contribute them to
Drill to enrich the set of operations available to others. We Drill core
developers are good at the internals, but you, Charles and others better
understand the kind of operations you want to apply to your data. We need your
help to build out the set of functions to solve your needs.
All that said, what are you trying to accomplish? Perhaps we can help you solve
your specific challenge.
Thanks,
- Paul
On Wednesday, January 9, 2019, 7:07:05 AM PST, [email protected]
<[email protected]> wrote:
Hi,
Although it's possible to work with ARRAY in DRILL :
SELECT columns[0], columns[1]... FROM ...;
I don't find any array function to know length or dimension of array or any
other useful fct
I don't find any possibility to "build" my array with different data like :
SELECT ARRAY[data1, data2, data3] FROM ...;
Aggregate function that build array doesn't exists
Example : SELECT mycolumn1, array_agg(mycolumn2) FROM mytable GROUP BY
mycolomn1;
Or would I miss something ?
I know that array type doesn't exists for Parquet files but these capacities
will be very useful
and the limitation of Parquet should not limit the possibilities of DRILL.
Appreciate any info or return on this subject ?
Regards,