[ 
https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017229#comment-13017229
 ] 

Krishna Kumar commented on HIVE-2065:
-------------------------------------

The minor version is needed so that we can still read 6.0 files correctly. To 
recap, 6.0 files have incorrect record length and while reading, we make the 
necessary recalculations to fix it up, while 6.1 onwards have the correct 
record length stored on disk.

[PS. I had suggested bumping up the sequence file version to 7 in a comment 
above, but I think a minor version is a better idea. The layout itself is still 
'kinda sorta' version-6-compatible. For all we know, there may be a sequence 
file version 7, and then sequence file version 7 and rc file version 7 would be 
divergent.]

[PPS. For the sake of completeness of documentation, here are the reason why 
the layout, even after the current patch, is still short of complete version-6 
compatibility : [a] The KeyBuffer, denoted as the key class, is unable to read 
or write itself from/to the disk stream as the reading/writing the 4-byte key 
contents length field and the compression/decompression are being done by the 
reader/writer and not the KeyBuffer class and [b] The ValueBuffer, the value 
class, must be compressed as a unit to be compatible to sequence file 
reader/writer, but it is actually compressed as several units.] 

> RCFile issues
> -------------
>
>                 Key: HIVE-2065
>                 URL: https://issues.apache.org/jira/browse/HIVE-2065
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: HIVE.2065.patch.0.txt, HIVE.2065.patch.1.txt, 
> Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per 
> yongqiang he, the class is not meant to be thread-safe (and it is not). Might 
> as well get rid of the confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression 
> happens after we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the 
> next field to record length, not the uncompressed key length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to