subject:"Dedupping json records based on nested value"

Re: Dedupping json records based on nested value

2017-08-31 Thread Aman Sinha

Hi Francois,
I haven't read the details of your use case but just want to make sure you
have looked at the nested data functions and ruled them out for your
requirement:  https://drill.apache.org/docs/nested-data-functions/

-Aman

On Thu, Aug 31, 2017 at 8:23 AM, François Méthot 
wrote:

> I manage to implement a single UDF that returns a copy of a MapHolder input
> var, it allowed me to figure how to use SingleMapReaderImpl input and
> ComplexWriter as out.
>
> I tried to move that approach into an aggregation function that looks like
> the snippet below.
> I want to return the first MapHolder value encountered in a group by
> operation.
>
> select firstvalue(tb1.field1), firstvalue(tb1.field2),
> firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1
> group by tb1.field4.key_data
>
> I get:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new
> vector for minor type [LATE] and mode [REQUIRED]
>
> I am not sure, if we can use the "out" variable within the add method.
> Any hint from the experts to put me on track would be appreciated.
>
> Thanks
> Francois
>
>
> @FunctionTemplate(name = "firstvalue", scope =
> FunctionTemplate.FunctionScope.POINT_AGGREGATE)
> public static class BitCount implements DrillAggFunc{
>public static class FirstValueComplex implements DrillAggFunc
>{
>  @Param
>  MapHolder map;
>
>  @Workspace
>  BitHolder firstSeen;
>
>  @Output
>  ComplexWriter out;
>
>  @Override
>  public void Setup()
>  {
> firstSeen.value=0;
>  }
>
>  @Override
>  public void add()
>  {
>if(firstSeen.value == 0)
>{
>   org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl
> map
> =
> (org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object)
> map;
>   map.copyAsValue(out.rootAsMap());
>   firstSeen.value=1;
>}
>  }
>
>  @Override
>  public void output()
>  {
>
>  }
>
>  @Override
>  public void reset()
>  {
>out.clear();
>firstSeen.value=0;
>  }
>}
> }
>
> On 30 August 2017 at 16:57, François Méthot  wrote:
>
> >
> > Hi,
> >
> > Congrat for the 1.11 release, we are happy to have our suggestion
> > implemented in the new release (automatic HDFS block size for parquet
> > files).
> >
> > It seems like we are pushing the limit of Drill with new type query...(I
> > am learning new SQL trick in the process)
> >
> > We are trying to aggregate a json document based on a nested value.
> >
> > Document looks like this:
> >
> > {
> >  "field1" : {
> >  "f1_a" : "infoa",
> >  "f1_b" : "infob"
> >   },
> >  "field2" : "very long string",
> >  "field3" : {
> >  "f3_a" : "infoc",
> >  "f3_b" : "infod",
> >  "f4_c" : {
> >   
> >   }
> >   },
> >   "field4" : {
> >  "key_data" : "String to aggregate on",
> >  "f4_b" : "a string2",
> >  "f4_c" : {
> >    complex structure...
> >   }
> >   }
> > }
> >
> >
> > We want a first, or last (or any) occurrence of field1, field2, field3
> and
> > field4 group by field4.key_data;
> >
> >
> > Unfortunately min, max function does not support json complex column
> > (MapHolder). Therefor group by type of queries do not work.
> >
> > We tried a window function like this
> > create table  as (
> >   select first_value(tb1.field1) over (partition by tb1.field4.key_data)
> > as field1,
> >first_value(tb1.field2) over (partition by tb1.field4.key_data) as
> > field2,
> >first_value(tb1.field3) over (partition by tb1.field4.key_data) as
> > field3,
> >first_value(tb1.field4) over (partition by tb1.field4.key_data) as
> > field4
> >   from dfs.`doc.json` tb1;
> > )
> >
> > We get IndexOutOfBoundException.
> >
> > We got better success with:
> > create table  as (
> >  select * from
> >   (select tb1.*,
> >   row_number() over (partition by tb1.field4.key_data) as row_num
> >from  dfs.`doc.json` tb1
> >   ) t
> >  where t.row_num = 1
> > )
> >
> > This works on single json file or with multiple file in a session
> > configured with planner.width_max_per_node=1.
> >
> > As soon as we put more than 1 thread per query, We get
> > IndexOutOfBoundException.
> > This was tried on 1.10 and 1.11.
> > It looks like a bug.
> >
> >
> > Would you have other suggestion to bypass that issue?
> > Is there an existing aggregation function (to work with group by) that
> > would return the first,last, or random MapHolder column from json
> document?
> > If not, I am thinking of implementing one, would there be an example on
> > how to Clone a MapHolder within a function? (pretty sure I can't assign
> > "in" param to output within a function)
> >
> >
> > Thank you for your time reading this.
> > any suggestions to try are welcome
> >
> > Francois
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Dedupping json records based on nested value

2017-08-31 Thread François Méthot

I manage to implement a single UDF that returns a copy of a MapHolder input
var, it allowed me to figure how to use SingleMapReaderImpl input and
ComplexWriter as out.

I tried to move that approach into an aggregation function that looks like
the snippet below.
I want to return the first MapHolder value encountered in a group by
operation.

select firstvalue(tb1.field1), firstvalue(tb1.field2),
firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1
group by tb1.field4.key_data

I get:
Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new
vector for minor type [LATE] and mode [REQUIRED]

I am not sure, if we can use the "out" variable within the add method.
Any hint from the experts to put me on track would be appreciated.

Thanks
Francois


@FunctionTemplate(name = "firstvalue", scope =
FunctionTemplate.FunctionScope.POINT_AGGREGATE)
public static class BitCount implements DrillAggFunc{
   public static class FirstValueComplex implements DrillAggFunc
   {
 @Param
 MapHolder map;

 @Workspace
 BitHolder firstSeen;

 @Output
 ComplexWriter out;

 @Override
 public void Setup()
 {
firstSeen.value=0;
 }

 @Override
 public void add()
 {
   if(firstSeen.value == 0)
   {
  org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl map
=
(org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object)map;
  map.copyAsValue(out.rootAsMap());
  firstSeen.value=1;
   }
 }

 @Override
 public void output()
 {

 }

 @Override
 public void reset()
 {
   out.clear();
   firstSeen.value=0;
 }
   }
}

On 30 August 2017 at 16:57, François Méthot  wrote:

>
> Hi,
>
> Congrat for the 1.11 release, we are happy to have our suggestion
> implemented in the new release (automatic HDFS block size for parquet
> files).
>
> It seems like we are pushing the limit of Drill with new type query...(I
> am learning new SQL trick in the process)
>
> We are trying to aggregate a json document based on a nested value.
>
> Document looks like this:
>
> {
>  "field1" : {
>  "f1_a" : "infoa",
>  "f1_b" : "infob"
>   },
>  "field2" : "very long string",
>  "field3" : {
>  "f3_a" : "infoc",
>  "f3_b" : "infod",
>  "f4_c" : {
>   
>   }
>   },
>   "field4" : {
>  "key_data" : "String to aggregate on",
>  "f4_b" : "a string2",
>  "f4_c" : {
>    complex structure...
>   }
>   }
> }
>
>
> We want a first, or last (or any) occurrence of field1, field2, field3 and
> field4 group by field4.key_data;
>
>
> Unfortunately min, max function does not support json complex column
> (MapHolder). Therefor group by type of queries do not work.
>
> We tried a window function like this
> create table  as (
>   select first_value(tb1.field1) over (partition by tb1.field4.key_data)
> as field1,
>first_value(tb1.field2) over (partition by tb1.field4.key_data) as
> field2,
>first_value(tb1.field3) over (partition by tb1.field4.key_data) as
> field3,
>first_value(tb1.field4) over (partition by tb1.field4.key_data) as
> field4
>   from dfs.`doc.json` tb1;
> )
>
> We get IndexOutOfBoundException.
>
> We got better success with:
> create table  as (
>  select * from
>   (select tb1.*,
>   row_number() over (partition by tb1.field4.key_data) as row_num
>from  dfs.`doc.json` tb1
>   ) t
>  where t.row_num = 1
> )
>
> This works on single json file or with multiple file in a session
> configured with planner.width_max_per_node=1.
>
> As soon as we put more than 1 thread per query, We get
> IndexOutOfBoundException.
> This was tried on 1.10 and 1.11.
> It looks like a bug.
>
>
> Would you have other suggestion to bypass that issue?
> Is there an existing aggregation function (to work with group by) that
> would return the first,last, or random MapHolder column from json document?
> If not, I am thinking of implementing one, would there be an example on
> how to Clone a MapHolder within a function? (pretty sure I can't assign
> "in" param to output within a function)
>
>
> Thank you for your time reading this.
> any suggestions to try are welcome
>
> Francois
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Dedupping json records based on nested value

2017-08-30 Thread François Méthot

Hi,

Congrat for the 1.11 release, we are happy to have our suggestion
implemented in the new release (automatic HDFS block size for parquet
files).

It seems like we are pushing the limit of Drill with new type query...(I am
learning new SQL trick in the process)

We are trying to aggregate a json document based on a nested value.

Document looks like this:

{
 "field1" : {
 "f1_a" : "infoa",
 "f1_b" : "infob"
  },
 "field2" : "very long string",
 "field3" : {
 "f3_a" : "infoc",
 "f3_b" : "infod",
 "f4_c" : {
  
  }
  },
  "field4" : {
 "key_data" : "String to aggregate on",
 "f4_b" : "a string2",
 "f4_c" : {
   complex structure...
  }
  }
}


We want a first, or last (or any) occurrence of field1, field2, field3 and
field4 group by field4.key_data;


Unfortunately min, max function does not support json complex column
(MapHolder). Therefor group by type of queries do not work.

We tried a window function like this
create table  as (
  select first_value(tb1.field1) over (partition by tb1.field4.key_data) as
field1,
   first_value(tb1.field2) over (partition by tb1.field4.key_data) as
field2,
   first_value(tb1.field3) over (partition by tb1.field4.key_data) as
field3,
   first_value(tb1.field4) over (partition by tb1.field4.key_data) as
field4
  from dfs.`doc.json` tb1;
)

We get IndexOutOfBoundException.

We got better success with:
create table  as (
 select * from
  (select tb1.*,
  row_number() over (partition by tb1.field4.key_data) as row_num
   from  dfs.`doc.json` tb1
  ) t
 where t.row_num = 1
)

This works on single json file or with multiple file in a session
configured with planner.width_max_per_node=1.

As soon as we put more than 1 thread per query, We get
IndexOutOfBoundException.
This was tried on 1.10 and 1.11.
It looks like a bug.


Would you have other suggestion to bypass that issue?
Is there an existing aggregation function (to work with group by) that
would return the first,last, or random MapHolder column from json document?
If not, I am thinking of implementing one, would there be an example on how
to Clone a MapHolder within a function? (pretty sure I can't assign "in"
param to output within a function)


Thank you for your time reading this.
any suggestions to try are welcome

Francois

Re: Dedupping json records based on nested value

Re: Dedupping json records based on nested value

Dedupping json records based on nested value

3 matches

Site Navigation

Mail list logo

Footer information