Re: Dedupping json records based on nested value

François Méthot Thu, 31 Aug 2017 08:24:00 -0700

I manage to implement a single UDF that returns a copy of a MapHolder input
var, it allowed me to figure how to use SingleMapReaderImpl input and
ComplexWriter as out.


I tried to move that approach into an aggregation function that looks like
the snippet below.
I want to return the first MapHolder value encountered in a group by
operation.

select firstvalue(tb1.field1), firstvalue(tb1.field2),
firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1
group by tb1.field4.key_data

I get:
Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new
vector for minor type [LATE] and mode [REQUIRED]

I am not sure, if we can use the "out" variable within the add method.
Any hint from the experts to put me on track would be appreciated.

Thanks
Francois


@FunctionTemplate(name = "firstvalue", scope =
FunctionTemplate.FunctionScope.POINT_AGGREGATE)
public static class BitCount implements DrillAggFunc{
   public static class FirstValueComplex implements DrillAggFunc
   {
     @Param
     MapHolder map;

     @Workspace
     BitHolder firstSeen;

     @Output
     ComplexWriter out;

     @Override
     public void Setup()
     {
        firstSeen.value=0;
     }

     @Override
     public void add()
     {
       if(firstSeen.value == 0)
       {
          org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl map
=
(org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object)map;
          map.copyAsValue(out.rootAsMap());
          firstSeen.value=1;
       }
     }

     @Override
     public void output()
     {

     }

     @Override
     public void reset()
     {
       out.clear();
       firstSeen.value=0;
     }
   }
}

On 30 August 2017 at 16:57, François Méthot <[email protected]> wrote:

>
> Hi,
>
> Congrat for the 1.11 release, we are happy to have our suggestion
> implemented in the new release (automatic HDFS block size for parquet
> files).
>
> It seems like we are pushing the limit of Drill with new type query...(I
> am learning new SQL trick in the process)
>
> We are trying to aggregate a json document based on a nested value.
>
> Document looks like this:
>
> {
>  "field1" : {
>      "f1_a" : "infoa",
>      "f1_b" : "infob"
>   },
>  "field2" : "very long string",
>  "field3" : {
>      "f3_a" : "infoc",
>      "f3_b" : "infod",
>      "f4_c" : {
>           ....
>       }
>   },
>   "field4" : {
>      "key_data" : "String to aggregate on",
>      "f4_b" : "a string2",
>      "f4_c" : {
>           .... complex structure...
>       }
>   }
> }
>
>
> We want a first, or last (or any) occurrence of field1, field2, field3 and
> field4 group by field4.key_data;
>
>
> Unfortunately min, max function does not support json complex column
> (MapHolder). Therefor group by type of queries do not work.
>
> We tried a window function like this
> create table .... as (
>   select first_value(tb1.field1) over (partition by tb1.field4.key_data)
> as field1,
>        first_value(tb1.field2) over (partition by tb1.field4.key_data) as
> field2,
>        first_value(tb1.field3) over (partition by tb1.field4.key_data) as
> field3,
>        first_value(tb1.field4) over (partition by tb1.field4.key_data) as
> field4
>   from dfs.`doc.json` tb1;
> )
>
> We get IndexOutOfBoundException.
>
> We got better success with:
> create table .... as (
>  select * from
>   (select tb1.*,
>           row_number() over (partition by tb1.field4.key_data) as row_num
>    from  dfs.`doc.json` tb1
>   ) t
>  where t.row_num = 1
> )
>
> This works on single json file or with multiple file in a session
> configured with planner.width_max_per_node=1.
>
> As soon as we put more than 1 thread per query, We get
> IndexOutOfBoundException.
> This was tried on 1.10 and 1.11.
> It looks like a bug.
>
>
> Would you have other suggestion to bypass that issue?
> Is there an existing aggregation function (to work with group by) that
> would return the first,last, or random MapHolder column from json document?
> If not, I am thinking of implementing one, would there be an example on
> how to Clone a MapHolder within a function? (pretty sure I can't assign
> "in" param to output within a function)
>
>
> Thank you for your time reading this.
> any suggestions to try are welcome
>
> Francois
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Dedupping json records based on nested value

Reply via email to