Hi, Congrat for the 1.11 release, we are happy to have our suggestion implemented in the new release (automatic HDFS block size for parquet files).
It seems like we are pushing the limit of Drill with new type query...(I am learning new SQL trick in the process) We are trying to aggregate a json document based on a nested value. Document looks like this: { "field1" : { "f1_a" : "infoa", "f1_b" : "infob" }, "field2" : "very long string", "field3" : { "f3_a" : "infoc", "f3_b" : "infod", "f4_c" : { .... } }, "field4" : { "key_data" : "String to aggregate on", "f4_b" : "a string2", "f4_c" : { .... complex structure... } } } We want a first, or last (or any) occurrence of field1, field2, field3 and field4 group by field4.key_data; Unfortunately min, max function does not support json complex column (MapHolder). Therefor group by type of queries do not work. We tried a window function like this create table .... as ( select first_value(tb1.field1) over (partition by tb1.field4.key_data) as field1, first_value(tb1.field2) over (partition by tb1.field4.key_data) as field2, first_value(tb1.field3) over (partition by tb1.field4.key_data) as field3, first_value(tb1.field4) over (partition by tb1.field4.key_data) as field4 from dfs.`doc.json` tb1; ) We get IndexOutOfBoundException. We got better success with: create table .... as ( select * from (select tb1.*, row_number() over (partition by tb1.field4.key_data) as row_num from dfs.`doc.json` tb1 ) t where t.row_num = 1 ) This works on single json file or with multiple file in a session configured with planner.width_max_per_node=1. As soon as we put more than 1 thread per query, We get IndexOutOfBoundException. This was tried on 1.10 and 1.11. It looks like a bug. Would you have other suggestion to bypass that issue? Is there an existing aggregation function (to work with group by) that would return the first,last, or random MapHolder column from json document? If not, I am thinking of implementing one, would there be an example on how to Clone a MapHolder within a function? (pretty sure I can't assign "in" param to output within a function) Thank you for your time reading this. any suggestions to try are welcome Francois