I manage to implement a single UDF that returns a copy of a MapHolder input
var, it allowed me to figure how to use SingleMapReaderImpl input and
ComplexWriter as out.
I tried to move that approach into an aggregation function that looks like
the snippet below.
I want to return the first MapHolder value encountered in a group by
operation.
select firstvalue(tb1.field1), firstvalue(tb1.field2),
firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1
group by tb1.field4.key_data
I get:
Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new
vector for minor type [LATE] and mode [REQUIRED]
I am not sure, if we can use the "out" variable within the add method.
Any hint from the experts to put me on track would be appreciated.
Thanks
Francois
@FunctionTemplate(name = "firstvalue", scope =
FunctionTemplate.FunctionScope.POINT_AGGREGATE)
public static class BitCount implements DrillAggFunc{
public static class FirstValueComplex implements DrillAggFunc
{
@Param
MapHolder map;
@Workspace
BitHolder firstSeen;
@Output
ComplexWriter out;
@Override
public void Setup()
{
firstSeen.value=0;
}
@Override
public void add()
{
if(firstSeen.value == 0)
{
org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl map
=
(org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object)map;
map.copyAsValue(out.rootAsMap());
firstSeen.value=1;
}
}
@Override
public void output()
{
}
@Override
public void reset()
{
out.clear();
firstSeen.value=0;
}
}
}
On 30 August 2017 at 16:57, François Méthot <[email protected]> wrote:
>
> Hi,
>
> Congrat for the 1.11 release, we are happy to have our suggestion
> implemented in the new release (automatic HDFS block size for parquet
> files).
>
> It seems like we are pushing the limit of Drill with new type query...(I
> am learning new SQL trick in the process)
>
> We are trying to aggregate a json document based on a nested value.
>
> Document looks like this:
>
> {
> "field1" : {
> "f1_a" : "infoa",
> "f1_b" : "infob"
> },
> "field2" : "very long string",
> "field3" : {
> "f3_a" : "infoc",
> "f3_b" : "infod",
> "f4_c" : {
> ....
> }
> },
> "field4" : {
> "key_data" : "String to aggregate on",
> "f4_b" : "a string2",
> "f4_c" : {
> .... complex structure...
> }
> }
> }
>
>
> We want a first, or last (or any) occurrence of field1, field2, field3 and
> field4 group by field4.key_data;
>
>
> Unfortunately min, max function does not support json complex column
> (MapHolder). Therefor group by type of queries do not work.
>
> We tried a window function like this
> create table .... as (
> select first_value(tb1.field1) over (partition by tb1.field4.key_data)
> as field1,
> first_value(tb1.field2) over (partition by tb1.field4.key_data) as
> field2,
> first_value(tb1.field3) over (partition by tb1.field4.key_data) as
> field3,
> first_value(tb1.field4) over (partition by tb1.field4.key_data) as
> field4
> from dfs.`doc.json` tb1;
> )
>
> We get IndexOutOfBoundException.
>
> We got better success with:
> create table .... as (
> select * from
> (select tb1.*,
> row_number() over (partition by tb1.field4.key_data) as row_num
> from dfs.`doc.json` tb1
> ) t
> where t.row_num = 1
> )
>
> This works on single json file or with multiple file in a session
> configured with planner.width_max_per_node=1.
>
> As soon as we put more than 1 thread per query, We get
> IndexOutOfBoundException.
> This was tried on 1.10 and 1.11.
> It looks like a bug.
>
>
> Would you have other suggestion to bypass that issue?
> Is there an existing aggregation function (to work with group by) that
> would return the first,last, or random MapHolder column from json document?
> If not, I am thinking of implementing one, would there be an example on
> how to Clone a MapHolder within a function? (pretty sure I can't assign
> "in" param to output within a function)
>
>
> Thank you for your time reading this.
> any suggestions to try are welcome
>
> Francois
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>