[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

eric baldeschwieler (JIRA) Fri, 16 Nov 2007 01:59:43 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542991
 ]


eric baldeschwieler commented on HADOOP-1722:
---------------------------------------------

I think arkady's point is much more to the point than this quoting proposal, 
which I think is going the wrong way!

There are two interfaces here - that between man & reduce and that into map and 
out of reduce.  I think they deserve different handling.

1) map in & reduce out - Should by default just consume bytes and produce 
bytes.  The framework should do no interpretation or quoting.  It should not 
try to break the output into lines, keys & values, etc.  It is just a byte 
stream.  This will allow true binary output with zero hassle.  Some thought on 
splits is clearly needed...

2) map out & reduce in - Here we clearly need keys and values.  But i think 
quoting might be the wrong direction.  It should certainly not be the default.  
I think we should consider just providing an option that specifies a new binary 
format will be used. here.  Maybe a 4 byte length followed a binary key 
followed by a 4 byte length and then a binary value?  Maybe with a record 
terminator for sanity checking?

----

Two observations:

1) Adding quoting by default will break all kinds of programs that work with 
streaming today.  This is undesirable.  We should add an option, not change the 
default behavior.

2) Streaming should not use utf8 anywhere!  It should assume that it is 
processing a stream of bytes that contains certain signal bytes '\n' and '\t'.  
I think we all agree on this.  treating the byte stream as a character stream 
just confuses things.



> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Christopher Zimmerman
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to