Hadoop Streaming should (optionally) sort on secondary key
----------------------------------------------------------

                 Key: HADOOP-765
                 URL: http://issues.apache.org/jira/browse/HADOOP-765
             Project: Hadoop
          Issue Type: Improvement
            Reporter: arkady borkovsky


This is related to HADOOP-485

As described in HADOOP-485 and HADOOP-686,  many algorithms need the values to 
come in specific order.  
(The most prominent is JOIN : in MapReduce implementation of JOIN, the value 
has to indicate which "table" the record comes from.  It is very useful to have 
records from the smaller "table" to come first.)

(a) once HADOOP-485 is implemented, it should be propagated to Streaming so 
that sorting by secondary is done without writing any code, but just with 
specifying a parameter.

(b) alternatively, as Hadoop Streaming records are lines of text with key(s) 
separated from the value by a tab, a simple hack of running a sort on the 
MERGED input of reduce will work fine.   This may be quite efficient and easy 
way to implement this important feature without relying on  HADOOP-485.   


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to