[
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237884#comment-13237884
]
Zhijie Shen commented on PIG-1314:
----------------------------------
Hi folks,
Below is my proposal draft. Any comments are welcome:-)
==
Proposal Title: Adding the Datetime Type as a Primitive for Pig
Student Name: Zhijie Shen
Student E-mail: [email protected]
Organization/Project: Apache Software Foundation - Pig
Assigned Mentor: Daniel Dai /Russell Jurney
Proposal Abstract:
Apache Pig is a platform for analyzing large data sets based on Hadoop.
Currently Pig does not support the primitive datetime type [1], which is a
desired feature to be implemented. In this proposal, I explain my plan to
implement the primitive datetime type, including the details of my solution and
schedule. Additionally, I briefly introduce my background and the motivation of
applying GSoC'12.
Detailed Description:
1. Understanding of the Project
1.1 What is Apache Pig?
Apache Pig is a platform for analyzing large data sets. Notably, at Yahoo! 40%
of all Hadoop jobs are run with Pig [5]. Pig has is own dataflow language,
named Pig Latin, which encapsulates map/reduce jobs step-by-step, and offers
the relational primitives such as LOAD, FOREACH, GROUP, FILTER and JOIN. Pig
provides many built-in functions, but also allow users to define their
user-defined functions (UDFs) to achieve particular purposes. There are more
benefits: Pig can operates on the plain files directly without any schema
information; it has a flexible, nested data model, which is more compatible
with that of major programming languages; it provides a debugging environment.
1.2 Why primitive datetime type is required?
Datetime is a conventional data type in many of database management systems as
well as programming languages. Within the Hadoop ecosystem, Hive, which is an
analog of Pig, also supports the primitive datetime type (timestamp actually).
In contrast, Pig does not fully support this type. Currently, users can only
use the string type for the datetime data, and rely on the UDF which takes
datetime strings. However, Pig is supposed to primarily parse log data, and
most log data has attributes in the datetime type.
Consequently, it is desired for Pig to support the datetime type as a
primitive. By doing so, we can expect the following benefits: a more compact
serialized format, working with conventional operators (+/-/==/!=/</>), a
dedicated faster comparator, being sortable, fewer times of runtime conversion
from string, and relieving users
from deciding the input datetime string format.
2. Roadmap of Implementing the New Feature
2.1 To Do List
2.1.1 Adding Support in Antlr Parser
Pig Latin supports the assign data type explicitly, such that the “datetime”
keyword and some constants, such as “now()” and “today()” can be recognized.
The related syntax needs to be added into 5 antlr scripts: AliasMasker.g,
AstPrinter.g, AstValidator.g, LogicalPlanGenerator.g, QueryParser.g.
2.1.2 Adding Datetime as a Primitive
The dateime type should be added into the DataType class, and the basic
conversion between it and other data types need to be defined. Previously, the
internal data structure relies on Joda datetime data type, which is more
powerful than java.util.DateTime, but much easier than java.util.Calendar.
Hence it is wise to keep this convention.
Moreover, be careful that implicit type cast from/to the datetime type is not
allowed.
I also need to change the LoadCaster and StoreCaster interfaces to include
bytesToDateTime/toBytes(DateTime) method, and add details to the classes that
implemented these two interfaces. In addition, I need override +/-/==/!=/</>
operators for the datetime type, mapping the to some bulitin EvalFuncs. The
TypeCheckingExpVisitor class needs to be modified as well to support the
datetime type vailidation. One important issue is that according to my previous
experience, the data type related code in Pig is widely spread, such that I
need to be careful all the related parts are touched.
2.1.3 Refactoring of the Datetime Related UDFs
Thanks Russell Jurney for having implemented a number of useful datetime
related UDFs, which can be utilized for the primitive datetime type as well.
Part of the UDF Classes located in the
“org.apache.pig.piggybank.evaluation.datetime” package under the “contrib”
folder need to be move to the “org.apache.pig.builtin” package under the “src”
folder. Below are the related UDFs:
int DiffDate(DateTime d1, DateTime d2)
int YearsBetween(DateTime d1, DateTime d2)
int MonthsBetween(DateTime d1, DateTime d2)
int DaysBetween(DateTime d1, DateTime d2)
int HoursBetween(DateTime d1, DateTime d2)
int MinutesBetween(DateTime d1, DateTime d2)
int SecondsBetween(DateTime d1, DateTime d2)
int GetYear(DateTime d1)
int GetMonth(DateTime d1)
int GetDate(DateTime d1)
int GetHour(DateTime d1)
int GetMinute(DateTime d1)
int GetSecond(DateTime d1)
DateTime DateAdd(DateTime d1)
String ToString(DateTime d, String format)
(Probably rename it DateTimeFormat)
The remaining UDFs can be eliminated, while their logics can be used in the
primitive type conversion part, which has been introduced in the previous
section. Below are the UDFs of this kind:
DateTime ToDate(String s)
DateTime ToDate(String s, String format)
DateTime ToDate(String s, String format, String timezone)
DateTime toDate(long t)
String ToString(DateTime d)
long ToUnixTime(DateTime d)
Probably the following additional UDFs are also required, I need to discuss
these with the community:
DateTime Now()
DateTime Today()
bool IsDateTime(String s)
2.1.4 Test Cases
A large number of test cases are required to test the parser, the datatime
operations and conversion, and loading from / storing into the secondary
storage.
2.1.5 Documentation
A user manual is required to describe how to use datetime primitive, such as
the input format, the supported built-in functions.
2.2 Project Schedule
During the summer, I will have not much workload except writing my Ph.D.
thesis. Hence it is possible for me to spend around 40 hours per week on this
project. The concrete schedule are summarized as follows:
Present - May 20 (before official start of summer of code): Reading the related
code in detail, and keeping touch with the community to clarify some issues,
such as the necessary built-in UDFs and the rules of data conversion.
May 21 - Jun 3 (two weeks): Adding the datetime into the primitive type list,
and completing the functionality of parsing the datetime keyword and
constraints, such that the string representing a datetime can be recognized
from Pig Lating scripts.
Jun 4 - Jun 24 (thee weeks): Implementing type conversion (from/to string) and
loading/storing cast functionality. After this step, data of the datetime type
can be correctly reading from/storing into the secondary storage.
Jun 25 - Jul 8 (two weeks until mid-term evaluation): Completing the remaining
part of the type conversion (e.g., between the datatime type and the long
type), dealing with some issues that have not been foreseen yet, and preparing
for the mid-term evaluation.
Jul 9 - Jul 29 (three weeks): Refactoring the datetime related UDFs, adding new
required UDFs, and overloading the primitive operators, such that all the
defined operations on datetime values are supported after this step.
Jul 30 - Aug 5 (one week): Writing the test cases to systematically verify the
code, debugging the possible bugs. After this step, the coding part is nearly
done.
Aug 6 - Aug 12 (one week until final evaluation ): Documenting the user manual
to show how to work with the datetime type, and preparing for the final
evaluation.
Additional Information:
I am a Ph.D. student from National University of Singapore. My research topics
are large scale multimedia systems, geo-referenced video systems and P2P video
streaming. In addition to research, I love programming and have long-term
experience in several languages, including Java. Moreover, I am quite
interested in distributed systems and big data, and have acquired solid
background knowledge. I used to take the course - "Parallel and Distributed
Databases", drafted a survey of the cloud storage systems (including Pig) [4]
and obtained the A+ score.
Notably, I am a open source advocate, and have contributed to it to some
extent. Last year, I have participated into GSoC with a Pig project. I
successfully implemented the nested cross feature [2]. And I overfulfiled my
proposed task by contributing one more patch of adding the primitive boolean
type [3], which is somewhat similar to the task proposed for this year's GsoC.
Therefore, I am quite familiar with this task and confident of completing it on
time. Last but not least, I enjoy the long term participation into the Pig
community, and am willing to keep contributing to it.
Reference:
[1] https://issues.apache.org/jira/browse/PIG-1314W
[2] https://issues.apache.org/jira/browse/PIG-1916
[3] https://issues.apache.org/jira/browse/PIG-1429
[4] http://www.comp.nus.edu.sg/~z-shen/survey.pdf
[5] http://wiki.apache.org/pig/OldFrontPage
> Add DateTime Support to Pig
> ---------------------------
>
> Key: PIG-1314
> URL: https://issues.apache.org/jira/browse/PIG-1314
> Project: Pig
> Issue Type: Bug
> Components: data
> Affects Versions: 0.7.0
> Reporter: Russell Jurney
> Assignee: Russell Jurney
> Labels: gsoc2012
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a
> timestamp component. Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?
> We're looking at doing this, rather than use UDFs. Is this a patch that
> would be accepted?
> This is a candidate project for Google summer of code 2012. More information
> about the program can be found at
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira