[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

Daniel Templeton (JIRA) Fri, 10 Jun 2016 10:06:06 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324821#comment-15324821
 ]


Daniel Templeton commented on MAPREDUCE-6712:
---------------------------------------------

[~He Tianyi], after chewing on this a bit more, I think I see a way that it 
could be done that wouldn't be too disruptive.  What if the first value comes 
through with the key, and subsequent values come through with a null key, i.e.:

{noformat}
key1\tvalue1
\tvalue2
\tvalue3
key2\tvalue4
\tvalue5
{noformat}

That approach breaks secondary sort and all legacy Streaming reducers, so it 
would have to be controlled by a config param that is off by default.  It's not 
an unreasonable approach, though.  Would that meet your needs?  I haven't 
looked at the Streaming code yet to see whether it's feasible, but I suspect it 
is.

> Support grouping values for reducer on java-side
> ------------------------------------------------
>
>                 Key: MAPREDUCE-6712
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: He Tianyi
>            Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

Reply via email to