[ 
https://issues.apache.org/jira/browse/DATAFU-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883473#comment-13883473
 ] 

Russell Jurney commented on DATAFU-23:
--------------------------------------

Yeah, I'm the author of ISOToHour and friends ;)

In this case, PadZero is required to rebuild an iso8601 string anytime you 
conduct a grouping operation with the datetime type in Pig. This is a bug in 
the API (it returns ints instead of '00', '08', etc. strings). PadZero 
therefore patches this bug in the API. PadZero will be useful any time you 
group by dates with a datetime object.

Not using a datetime type altogether and using ISOStrings and the Truncate 
functions is an alternative. Rewriting date_time and breaking the APIs because 
the implementation is so problematic (in this and other ways) is an option. 
Adding truncate builtins to Pig itself that work on Pig's datetime is another 
option.

I'm open to making this more generally useful. This is something I need now, to 
fix the bug in the datetime API, so I've submitted it. It may be that this will 
live in our branch alone. I'll look at writing the more useful date rounding 
UDF later. That is probably the middle path.

> Create datafu.pig.util.PadZero to pad integers < 10 with 0s
> -----------------------------------------------------------
>
>                 Key: DATAFU-23
>                 URL: https://issues.apache.org/jira/browse/DATAFU-23
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Russell Jurney
>         Attachments: DATAFU-23.patch
>
>
> /* Now group by time down to the hour, our time series granularity */
> grouped_by_time = GROUP bytes_in_out BY (GetYear(date_time), 
> GetMonth(date_time), GetDay(date_time), GetHour(date_time));
> bytes_per_hour = FOREACH grouped_by_time GENERATE FLATTEN(group) AS (year, 
> month, day, hour), 
>                                                   SUM(bytes_in_out.sc_bytes) 
> AS total_sc_bytes,
>                                                   SUM(bytes_in_out.cs_bytes) 
> AS total_cs_bytes;
> /* Now convert time elements back into a key for HBase */
> bytes_per_hour = FOREACH bytes_per_hour GENERATE ToDate(StringConcat(year, 
> '-', month, '-', day, 'T', hour, ':00:00.000Z')) AS date_time, 
>                                                  total_sc_bytes, 
>                                                  total_cs_bytes;
> The previous code will erroneously generate bad ISO8601 dates, looking like 
> this: "2005-1-1:1:00:00.000Z"
> Therefore a PadZero utility is needed to regenerate ISO8601 keys after 
> grouping by date pieces.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to