[jira] [Commented] (PIG-1314) Add DateTime Support to Pig

Zhijie Shen (Commented) (JIRA) Sun, 25 Mar 2012 07:45:54 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237884#comment-13237884
 ]


Zhijie Shen commented on PIG-1314:
----------------------------------

Hi folks,

Below is my proposal draft. Any comments are welcome:-)

==

Proposal Title: Adding the Datetime Type as a Primitive for Pig


Student Name: Zhijie Shen 
Student E-mail: [email protected] 

Organization/Project: Apache Software Foundation - Pig 
Assigned Mentor: Daniel Dai /Russell Jurney


Proposal Abstract: 

Apache Pig is a platform for analyzing large data sets based on Hadoop. 
Currently Pig does not support the primitive datetime type [1], which is a 
desired feature to be implemented. In this proposal, I explain my plan to 
implement the primitive datetime type, including the details of my solution and 
schedule. Additionally, I briefly introduce my background and the motivation of 
applying GSoC'12. 

Detailed Description: 

1. Understanding of the Project

1.1 What is Apache Pig?

Apache Pig is a platform for analyzing large data sets. Notably, at Yahoo! 40% 
of all Hadoop jobs are run with Pig [5]. Pig has is own dataflow language, 
named Pig Latin, which encapsulates map/reduce jobs step-by-step, and offers 
the relational primitives such as LOAD, FOREACH, GROUP, FILTER and JOIN. Pig 
provides many built-in functions, but also  allow users to define their 
user-defined functions (UDFs) to achieve particular purposes. There are more 
benefits: Pig can operates on the plain files directly without any schema 
information; it has a flexible, nested data model, which is more compatible 
with that of major programming languages; it provides a debugging environment.

1.2 Why primitive datetime type is required?

Datetime is a conventional data type in many of database management systems as 
well as programming languages. Within the Hadoop ecosystem, Hive, which is an 
analog of Pig, also supports the primitive datetime type (timestamp actually). 
In contrast, Pig does not fully support this type. Currently, users can only 
use the string type for the datetime data, and rely on the UDF which takes 
datetime strings. However, Pig is supposed to primarily parse log data, and 
most log data has attributes in the datetime type. 

Consequently, it is desired for Pig to support the datetime type as a 
primitive. By doing so, we can expect the following benefits: a more compact 
serialized format, working with conventional operators (+/-/==/!=/</>), a 
dedicated faster comparator, being sortable, fewer times of runtime conversion 
from string, and relieving users
 from deciding the input datetime string format.


2. Roadmap of Implementing the New Feature

2.1 To Do List

2.1.1  Adding Support in Antlr Parser

Pig Latin supports the assign data type explicitly, such that the “datetime” 
keyword and some constants, such as “now()” and “today()” can be recognized. 
The related syntax needs to be added into 5 antlr scripts: AliasMasker.g, 
AstPrinter.g, AstValidator.g, LogicalPlanGenerator.g, QueryParser.g.

2.1.2 Adding Datetime as a Primitive

The dateime type should be added into the DataType class, and the basic 
conversion between it and other data types need to be defined. Previously, the 
internal data structure relies on Joda datetime data type, which is more 
powerful than java.util.DateTime, but much easier than java.util.Calendar. 
Hence it is wise to keep this convention.
 Moreover, be careful that implicit type cast from/to the datetime type is not 
allowed.

I also need to change the LoadCaster and StoreCaster interfaces to include 
bytesToDateTime/toBytes(DateTime) method, and add details to the classes that 
implemented these two interfaces. In addition, I need override +/-/==/!=/</> 
operators for the datetime type, mapping the to some bulitin EvalFuncs. The 
TypeCheckingExpVisitor class needs to be modified as well to support the 
datetime type vailidation. One important issue is that according to my previous 
experience, the data type related code in Pig is widely spread, such that I 
need to be careful all the related parts are touched.

2.1.3 Refactoring of the Datetime Related UDFs

Thanks Russell Jurney for having implemented a number of useful datetime 
related UDFs, which can be utilized for the primitive datetime type as well. 
Part of the UDF Classes located in the 
“org.apache.pig.piggybank.evaluation.datetime” package  under the “contrib” 
folder need to be move to the “org.apache.pig.builtin” package under the “src” 
folder. Below are the related UDFs:

int DiffDate(DateTime d1, DateTime d2)
int YearsBetween(DateTime d1, DateTime d2)
int MonthsBetween(DateTime d1, DateTime d2)
int DaysBetween(DateTime d1, DateTime d2)
int HoursBetween(DateTime d1, DateTime d2)
int MinutesBetween(DateTime d1, DateTime d2)
int SecondsBetween(DateTime d1, DateTime d2)
int GetYear(DateTime d1)
int GetMonth(DateTime d1)
int GetDate(DateTime d1)
int GetHour(DateTime d1)
int GetMinute(DateTime d1)
int GetSecond(DateTime d1)
DateTime DateAdd(DateTime d1)
String ToString(DateTime d, String format)
 (Probably rename it DateTimeFormat)

The remaining UDFs can be eliminated, while their logics can be used in the 
primitive type conversion part, which has been introduced in the previous 
section. Below are the UDFs of this kind:

DateTime ToDate(String s)
DateTime ToDate(String s, String format)
DateTime ToDate(String s, String format, String timezone)
DateTime toDate(long t)
String ToString(DateTime d)
long ToUnixTime(DateTime d)

Probably the following additional UDFs are also required, I need to discuss 
these with the community:

DateTime Now()
DateTime Today()
bool IsDateTime(String s)

2.1.4 Test Cases

A large number of test cases are required to test the parser, the datatime 
operations and conversion, and loading from / storing into the secondary 
storage.

2.1.5 Documentation

A user manual is required to describe how to use datetime primitive, such as 
the input format, the supported built-in functions.

2.2 Project Schedule 

During the summer, I will have not much workload except writing my Ph.D. 
thesis. Hence it is possible for me to spend around 40 hours per week on this 
project. The concrete schedule are summarized as follows: 

Present - May 20 (before official start of summer of code): Reading the related 
code in detail, and keeping touch with the community to clarify some issues, 
such as the necessary built-in UDFs and the rules of data conversion.

May 21 - Jun 3 (two weeks):  Adding the datetime into the primitive type list, 
and completing the functionality of  parsing the datetime keyword and 
constraints, such that the string representing a datetime can be recognized 
from Pig Lating scripts.

Jun 4 - Jun 24 (thee weeks): Implementing type conversion (from/to string) and 
loading/storing cast functionality. After this step, data of the datetime type 
can be correctly reading from/storing into the secondary storage.

Jun 25 - Jul 8 (two weeks until mid-term evaluation): Completing the remaining 
part of the type conversion (e.g., between the datatime type and the long 
type),  dealing with some issues that have not been foreseen yet, and preparing 
for the mid-term evaluation.

Jul 9 - Jul 29 (three weeks): Refactoring the datetime related UDFs, adding new 
required UDFs, and overloading the primitive operators, such that all the 
defined operations on datetime values are supported after this step.

Jul 30 - Aug 5 (one week):  Writing the test cases to systematically verify the 
code, debugging the possible bugs. After this step, the coding part is nearly 
done.

Aug 6 - Aug 12 (one week until final evaluation ): Documenting the user manual 
to show how to work with the datetime type, and preparing for the final 
evaluation.

Additional Information: 

I am a Ph.D. student from National University of Singapore. My research topics 
are large scale multimedia systems, geo-referenced video systems and P2P video 
streaming. In addition to research, I love programming and have long-term 
experience in several languages, including Java.  Moreover, I am quite 
interested in distributed systems and big data, and have acquired solid 
background knowledge.  I used to take the course - "Parallel and Distributed 
Databases", drafted a survey of the cloud storage systems (including Pig) [4] 
and obtained the A+ score. 

Notably, I am a open source advocate, and have contributed to it to some 
extent. Last year, I  have participated into GSoC with a Pig project. I 
successfully implemented the nested cross feature [2]. And I overfulfiled my 
proposed task by contributing one more patch of adding the primitive boolean 
type [3], which is somewhat similar to the task proposed for this year's GsoC. 
Therefore, I am quite familiar with this task and confident of completing it on 
time. Last but not least, I enjoy the long term participation into the Pig 
community, and am willing to keep contributing to it.


Reference:

[1] https://issues.apache.org/jira/browse/PIG-1314W
[2] https://issues.apache.org/jira/browse/PIG-1916
[3] https://issues.apache.org/jira/browse/PIG-1429
[4] http://www.comp.nus.edu.sg/~z-shen/survey.pdf
[5] http://wiki.apache.org/pig/OldFrontPage
                
> Add DateTime Support to Pig
> ---------------------------
>
>                 Key: PIG-1314
>                 URL: https://issues.apache.org/jira/browse/PIG-1314
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>    Affects Versions: 0.7.0
>            Reporter: Russell Jurney
>            Assignee: Russell Jurney
>              Labels: gsoc2012
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a 
> timestamp component.  Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?  
> We're looking at doing this, rather than use UDFs.  Is this a patch that 
> would be accepted?
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1314) Add DateTime Support to Pig

Reply via email to