Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for
change notification.
The following page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedProjects
New page:
= Proposed Pig Projects =
This page describes projects what we (the committers) would like to
see
added
to Pig. The scale of these projects vary, but they are larger
projects,
usually on the weeks or months scale. We have not yet filed
[https://issues.apache.org/jira/browse/PIG JIRAs] for some of these
because they are still in the vague idea stage. As they become more
concrete,
[https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for
them.
We welcome contributers to take on one of these projects. If you
would
like
to do so, please file a JIRA (if one does not already exist for the
project)
with a proposed solution. Pig's committers will work with you from
there
to
help refine your solution. Once a solution is agreed upon, you can
begin
implementation.
If you see a project here that you would like to see Pig implement
but you
are
not in a position to implement the solution right now, feel free to
vote
for
the project. Add your name to the list of supporters. This will
help
contributers looking for a project to select one that will benefit
many
users.
If you would like to propose a project for Pig, feel free to add to
this
list.
If it is a smaller project, or something you plan to begin work on
immediately, filing a [https://issues.apache.org/jira/browse/PIG
JIRA] is
a better route.
|| Catagory || Project || JIRA || Proposed By || Votes For ||
|| Execution || Pig currently executes scripts by building a
pipeline of
pre-built operators and running data through those operators in map
reduce
jobs. We need to investigate instead have Pig generate java code
specific
to a job, and then compiling that code and using it to run the map
reduce
jobs. || || Many conference attendees || gates ||
|| Language || Currently only DISTINCT, ORDER BY, and FILTER are
allowed
inside FOREACH. All operators should be allowed in FOREACH. (Limit
is being
worked on [https://issues.apache.org/jira/browse/PIG-741 741] || ||
gates
|| ||
|| Optimization || Speed up comparison of tuples during shuffle for
ORDER
BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan
|| ||
|| Optimization || Order by should be changed to not use POPackage
to put
all of the tuples in a bag on the reduce side, as the bag is just
immediately flattened. It can instead work like join does for the
last
input in the join. || || gates || ||
|| Optimization || Often in a Pig script that produces a chain of
MR jobs,
the map phases of 2nd and subsequent jobs very little. What little
they do
should be pushed into the proceeding reduce and the map replaced by
the
identity mapper. Initial tests showed that the identity mapper was
50%
faster than using a Pig mapper (because Pig uses the loader to
parse out
tuples even if the map itself is empty). || [
https://issues.apache.org/jira/browse/PIG-480 480] || olgan ||
gates ||
|| Optimization || Use hand crafted calls to do string to integer
or float
conversions. Initial tests showed these could be done about 8x
faster than
String.toIntger() and String.toFloat(). || [
https://issues.apache.org/jira/browse/PIG-482 482] || olgan ||
gates ||
|| Optimization || Currently Pig always samples for and ORDER BY to
determine how to partition, and then runs another job to do the
sort. For
small enough inputs, it should just sort with a single reducer. || [
https://issues.apache.org/jira/browse/PIG-483 483] || olgan || ||
|| Optimization || In many cases data to be joined is already
sorted and
partitioned on the same key. Pig needs to be able to take
advantage of this
and do these joins in the map. The join could be done by sampling
one input
to determine the value of the join key at the beginning of every
HDFS block.
This would form an index. Then in a second MR job can be run with
the
other input. Based on the key seen in the second input, the
appropriate
blocks of the first input can also be loaded into the map and the
join done.
|| || gates || ||
|| Optimization || The combiner is not currently used if FILTER is
in the
FOREACH. In some cases it could still be used. || [
https://issues.apache.org/jira/browse/PIG-479 479] || olgan || ||
|| Optimization || Currently when types of data are declared Pig
inserts a
FOREACH immediately after the LOAD that does the conversions. These
conversions should be delayed until the field is actually used. || [
https://issues.apache.org/jira/browse/PIG-410 410] || olgan ||
gates ||
|| Optimization || When an order by is not the only operation in a
pig
script, it is done in two additional MR jobs. The first job
samples using a
sampling loader, the second does the sort. The sample is used to
construct
a partitioner that equally balances the data in the sort. The
sampler needs
to be changed to be a !EvalFunc instead of a loader. This way a
split can
be but in the proceeding MR job, with the main data being written
out and
the other part flowing to the sampler func, which can then write
out the
sample. The final MR job can then be the sort. || || gates || ||
|| Optimization || When an order by is the only operation in a pig
script
it is currently done in 3 MR jobs. The first converts it to
BinStorage
format (because the sample loader reads that format), the second
samples,
and the third sorts. Once the changes mentioned above to make the
sampler
an !EvalFunc are done it should be changed to be done in 2 MR jobs
instead
of 3. || [https://issues.apache.org/jira/browse/PIG-460 460] ||
gates ||
||
|| Optimization || The Pig optimizer should be used to determine when
fields in a record are no longer needed and put in FOREACH
statements to
project out the unecessary data as early as possible. || [
https://issues.apache.org/jira/browse/PIG-466 466] || olgan || ||
|| Optimization || The Pig optimizers needs to call fieldsToRead so
that
Load functions that can do column skipping do it. || || gates || ||
|| Scalability || Pig's default join (symmetric hash) currently
depends on
being able to fit all of the values for a given join key for one of
the
inputs into memory. (It does try to spill to disk in the case
where it
cannot fit them all into memory. In practice this often fails as
it is not
good at understanding when memory is low enough that it should
spill. Even
in the case where it does not fail, spilling to disk and rereading
from disk
is very slow.) If instances of keys with a large number of values
were
broken up so that the row set could fit in memory and then shipped to
multiple reducers. A sampling pass would need to be done first to
determine
which keys to break up. || || chris olston || gates ||