Hadoop Streaming should (optionally) sort on secondary key
----------------------------------------------------------
Key: HADOOP-765
URL: http://issues.apache.org/jira/browse/HADOOP-765
Project: Hadoop
Issue Type: Improvement
Reporter: arkady borkovsky
This is related to HADOOP-485
As described in HADOOP-485 and HADOOP-686, many algorithms need the values to
come in specific order.
(The most prominent is JOIN : in MapReduce implementation of JOIN, the value
has to indicate which "table" the record comes from. It is very useful to have
records from the smaller "table" to come first.)
(a) once HADOOP-485 is implemented, it should be propagated to Streaming so
that sorting by secondary is done without writing any code, but just with
specifying a parameter.
(b) alternatively, as Hadoop Streaming records are lines of text with key(s)
separated from the value by a tab, a simple hack of running a sort on the
MERGED input of reduce will work fine. This may be quite efficient and easy
way to implement this important feature without relying on HADOOP-485.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira