Dedupping json records based on nested value

François Méthot Wed, 30 Aug 2017 13:58:27 -0700

Hi,

Congrat for the 1.11 release, we are happy to have our suggestion
implemented in the new release (automatic HDFS block size for parquet
files).


It seems like we are pushing the limit of Drill with new type query...(I am
learning new SQL trick in the process)

We are trying to aggregate a json document based on a nested value.

Document looks like this:

{
 "field1" : {
     "f1_a" : "infoa",
     "f1_b" : "infob"
  },
 "field2" : "very long string",
 "field3" : {
     "f3_a" : "infoc",
     "f3_b" : "infod",
     "f4_c" : {
          ....
      }
  },
  "field4" : {
     "key_data" : "String to aggregate on",
     "f4_b" : "a string2",
     "f4_c" : {
          .... complex structure...
      }
  }
}


We want a first, or last (or any) occurrence of field1, field2, field3 and
field4 group by field4.key_data;


Unfortunately min, max function does not support json complex column
(MapHolder). Therefor group by type of queries do not work.

We tried a window function like this
create table .... as (
  select first_value(tb1.field1) over (partition by tb1.field4.key_data) as
field1,
       first_value(tb1.field2) over (partition by tb1.field4.key_data) as
field2,
       first_value(tb1.field3) over (partition by tb1.field4.key_data) as
field3,
       first_value(tb1.field4) over (partition by tb1.field4.key_data) as
field4
  from dfs.`doc.json` tb1;
)

We get IndexOutOfBoundException.

We got better success with:
create table .... as (
 select * from
  (select tb1.*,
          row_number() over (partition by tb1.field4.key_data) as row_num
   from  dfs.`doc.json` tb1
  ) t
 where t.row_num = 1
)

This works on single json file or with multiple file in a session
configured with planner.width_max_per_node=1.

As soon as we put more than 1 thread per query, We get
IndexOutOfBoundException.
This was tried on 1.10 and 1.11.
It looks like a bug.


Would you have other suggestion to bypass that issue?
Is there an existing aggregation function (to work with group by) that
would return the first,last, or random MapHolder column from json document?
If not, I am thinking of implementing one, would there be an example on how
to Clone a MapHolder within a function? (pretty sure I can't assign "in"
param to output within a function)


Thank you for your time reading this.
any suggestions to try are welcome

Francois

Dedupping json records based on nested value

Reply via email to