[ https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542991 ]
eric baldeschwieler commented on HADOOP-1722: --------------------------------------------- I think arkady's point is much more to the point than this quoting proposal, which I think is going the wrong way! There are two interfaces here - that between man & reduce and that into map and out of reduce. I think they deserve different handling. 1) map in & reduce out - Should by default just consume bytes and produce bytes. The framework should do no interpretation or quoting. It should not try to break the output into lines, keys & values, etc. It is just a byte stream. This will allow true binary output with zero hassle. Some thought on splits is clearly needed... 2) map out & reduce in - Here we clearly need keys and values. But i think quoting might be the wrong direction. It should certainly not be the default. I think we should consider just providing an option that specifies a new binary format will be used. here. Maybe a 4 byte length followed a binary key followed by a 4 byte length and then a binary value? Maybe with a record terminator for sanity checking? ---- Two observations: 1) Adding quoting by default will break all kinds of programs that work with streaming today. This is undesirable. We should add an option, not change the default behavior. 2) Streaming should not use utf8 anywhere! It should assume that it is processing a stream of bytes that contains certain signal bytes '\n' and '\t'. I think we all agree on this. treating the byte stream as a character stream just confuses things. > Make streaming to handle non-utf8 byte array > -------------------------------------------- > > Key: HADOOP-1722 > URL: https://issues.apache.org/jira/browse/HADOOP-1722 > Project: Hadoop > Issue Type: Improvement > Components: contrib/streaming > Reporter: Runping Qi > Assignee: Christopher Zimmerman > > Right now, the streaming framework expects the output sof the steam process > (mapper or reducer) are line > oriented UTF-8 text. This limit makes it impossible to use those programs > whose outputs may be non-UTF-8 > (international encoding, or maybe even binary data). Streaming can overcome > this limit by introducing a simple > encoding protocol. For example, it can allow the mapper/reducer to hexencode > its keys/values, > the framework decodes them in the Java side. > This way, as long as the mapper/reducer executables follow this encoding > protocol, > they can output arabitary bytearray and the streaming framework can handle > them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.