Re: [jira] Commented: (HADOOP-1162) Record IO: seariliizing a byte buffer to CSV fails if buffer contains bytes less than 16.

David Bowen Tue, 27 Mar 2007 16:38:20 -0800

>
> Oh, i misunderstood your question. I am switching buffer serialization to 
> just plain bytes except for 5 characters that are escaped (essentially 
> similar to string serialization as if the string were iso-8859-1.)
>   
I'm not sure I follow.  I think string serialization implies UTF-8
encoding?  That means bytes in the range 128-255 would take 2 bytes.  If
we assume that in a byte buffer, all byte values are equally probable,
then the average space for CSV serialization, per byte, would be 1.5
bytes, or 12 bits.  Right?  Actually a little more because you escape
those 5 characters too.


So why not use base64 encoding?  The expansion factor would be less,
since it essentially uses 8 bits to represent 6.  Also, it omits control
characters which I think would be a problem with what you're suggesting
- we need the CSV files to be human readable, so I think you'd have to
escape them too.

Or else just leave the encoding as two hex digits per byte.

Re: [jira] Commented: (HADOOP-1162) Record IO: seariliizing a byte buffer to CSV fails if buffer contains bytes less than 16.

Reply via email to