In conversation with Matt, an issue has come up that merits broader
discussion on this list.

Background:

There are two different release schedules that occur between Pig and DataFu:

1) Pig is released about twice a year (14 releases in 6 years). Getting UDF
code into Pig (builtin) or Piggybank is a major undertaking. What is more,
the popular Hadoop distributions (Cloudera, Hortonworks, MapR, Pivotal) lag
behind the current Apache version of Pig by a year or more. In other words:
adding a simple UDF to Pig can take a year and a half to actually reach
users.

2) DataFu releases every month or two, as new features are added. Using
DataFu is as simple as grabbing a jar file, so it isn't tied to a
distribution (although several include it). One needn't upgrade Pig to use
new features of DataFu.

This leads to an interesting situation... take PIG-3939
<https://issues.apache.org/jira/browse/PIG-3939>, which added SPRINTF
<http://pig.apache.org/docs/r0.14.0/api/org/apache/pig/builtin/SPRINTF.html>
as a Pig builtin, in Pig 0.14, released in November, 2014. In practice, Pig
users wanting SPRINTF must wait for the distributions to include Pig 0.14,
which could take a year or more. When you factor in the six-month time
between the patch's submission (June, 2014) and release (November, 2014),
it could take two or more years for most users to get the SPRINTF feature.

Issue:

For me, this begs the question... why don't we add SPRINTF to DataFu, so
that older versions of Pig (before 0.14) can have this feature? I happen to
be in a situation where we're using CDH 5.2/Pig 0.12, and we need SPRINTF.
I think this is a common situation.

So the question I'm raising is: *Is it appropriate to implement UDF/builtin
features of Pig in DataFu, to enable older versions of Pig to use them and
dramatically decrease the delay until users can start using them?*

In the case of SPRINTF, I believe we should add it to DataFu. I've created
DATAFU-85 <https://issues.apache.org/jira/browse/DATAFU-85> to track this
issue. The Hadoop distributions won't ship 0.14 for some time. The majority
of Hadoop users will be using Pig 0.12 for several years. Adding this kind
of feature will benefit users in the meanwhile.

-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
ᐧ

Reply via email to