[ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324821#comment-15324821 ]
Daniel Templeton commented on MAPREDUCE-6712: --------------------------------------------- [~He Tianyi], after chewing on this a bit more, I think I see a way that it could be done that wouldn't be too disruptive. What if the first value comes through with the key, and subsequent values come through with a null key, i.e.: {noformat} key1\tvalue1 \tvalue2 \tvalue3 key2\tvalue4 \tvalue5 {noformat} That approach breaks secondary sort and all legacy Streaming reducers, so it would have to be controlled by a config param that is off by default. It's not an unreasonable approach, though. Would that meet your needs? I haven't looked at the Streaming code yet to see whether it's feasible, but I suspect it is. > Support grouping values for reducer on java-side > ------------------------------------------------ > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming > Reporter: He Tianyi > Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each > line representing a (k, v) tuple from {{stdin}}, in which values with > identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter > (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, > records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each > record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for > caching, > Suppose we need another InputWriter. But this is not enough, since the > interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not > {{writeValues}}. Though we can compare key in custom InputWriter and group > them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org