[ 
https://issues.apache.org/jira/browse/TAJO-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502318#comment-14502318
 ] 

Jihoon Son edited comment on TAJO-1562 at 4/22/15 6:04 AM:
-----------------------------------------------------------

Hi guys. This is the first proposal.
Honestly, I'm not much familiar with Python, so, this proposal may be weird. 
Welcome any suggestions and comments.

I investigated several features of Python. Finally, I think that the class of 
Python looks appropriate to support UDAF. That is, users can define a new UDAF 
by defining a Python class which inherits a pre-defined AbstractUdaf class.
Here is an example.

*AvgPy class*
{code}
from tajo_util import output_type


class AvgPy:
    sum = 0
    cnt = 0

    def __init__(self, sum=0, cnt=0):
        self.sum = sum
        self.cnt = cnt

    # eval at the first stage
    def eval(self, item):
        self.sum += item
        self.cnt += 1

    # get intermediate result
    @output_type('int4', 'int4')
    def get_partial_result(self):
        return [self.sum, self.cnt]

    # merge intermediate results
    def merge(self, sum, cnt):
        self.sum += sum
        self.cnt += cnt

    # get final result
    @output_type('float4')
    def get_final_result(self):
        return self.sum / self.cnt
{code}

To do support this form of UDAFs, we should support a general way to maintain 
the aggregated values, e.g., aggregated in SumPy, between different stages. I 
think that this can be solved by serializing/deserializing them as a tuple as 
follows.

{code}
message NamedDatum {
  required string name = 1;
  required Datum val = 2;
}

message NamedTuple {
  repeated NamedDatum datums = 1;
}
{code}


was (Author: jihoonson):
Hi guys. This is the first proposal.
Honestly, I'm not much familiar with Python, so, this proposal may be weird. 
Welcome any suggestions and comments.

I investigated several features of Python. Finally, I think that the class of 
Python looks appropriate to support UDAF. That is, users can define a new UDAF 
by defining a Python class which inherits a pre-defined AbstractUdaf class.
Here is an example.

*AvgPy class*
{code}
from tajo_util import output_type


class AvgPy:
    sum = 0
    cnt = 0

    def __init__(self, sum=0, cnt=0):
        self.sum = sum
        self.cnt = cnt

    # eval at the first stage
    def eval(self, item):
        self.sum += item
        self.cnt += 1

    # get intermediate result
    @output_type('int4', 'int4')
    def get_interm_result(self):
        return [self.sum, self.cnt]

    # merge intermediate results
    def merge(self, item):
        self.sum += item
        self.cnt += 1

    # get final result
    @output_type('float4')
    def get_final_result(self):
        return self.sum / self.cnt
{code}

To do support this form of UDAFs, we should support a general way to maintain 
the aggregated values, e.g., aggregated in SumPy, between different stages. I 
think that this can be solved by serializing/deserializing them as a tuple as 
follows.

{code}
message NamedDatum {
  required string name = 1;
  required Datum val = 2;
}

message NamedTuple {
  repeated NamedDatum datums = 1;
}
{code}

> Python UDAF support
> -------------------
>
>                 Key: TAJO-1562
>                 URL: https://issues.apache.org/jira/browse/TAJO-1562
>             Project: Tajo
>          Issue Type: New Feature
>          Components: function/udf
>            Reporter: Jihoon Son
>            Assignee: Jihoon Son
>             Fix For: 0.11.0
>
>
> We need to support Python UDAF as well as UDF (TAJO-1344). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to