[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718825#comment-13718825 ]
Prasanth J commented on HIVE-4123: ---------------------------------- {quote}Comments: merge Utils into SerializationUtils. use the zigzag encode/decode in the the SerializationUtils.read/writeVslong move Utils.nextLong to the test code Utils.getTotalBytesRequired should just use long math. (n * numBits + 7) / 8 should work Rename IntegerCompressionReader/Writer to RunLengthIntegerReader/WriterV2 {quote} Done. {quote} Create an interface IntegerReader that has: seek next skip {quote} Added hasNext() to interface as well. {quote} Make RunLengthIntegerReader and RunLengthIntegerReaderV2 implement IntegerReader The TreeReaders should declare the fields as IntegerReader. Each of the startStripe should use the encoding to create the right implementation of IntegerReader. We should do the same with an IntegerWriter interface. Replace fixedBitSizes with static methods in SerializationUtils: static int encodeBitWidth(int n) static int decodeBitWidth(int n) {quote} Done. {quote} Finding the percentiles seems expensive, we should look at an alternative {quote} Done. {quote} Why is the delta blob zigzag encoded? The sign should always be positive or negative for the entire run. {quote} Made the delta base field mandatory, blob is now directly bit packed. {quote} Maybe we could create an enum in the Writer that is the version to write that would look like enum OrcVersion { V0_11, V0_12 } and the StreamFactory could provide the version to the TreeWriters. {quote} Not done (as per your last comment about passing factory object) {quote} I don't see why bitpack reader/writer are more than static methods that read/write to the underlying stream. So I would have expected a method like writeInts(long[] data, int offset, int length, int numBits, OutputStream stream) and the corresponding one for reading. {quote} Added as a separate static method. Can we reuse BitFieldReader/BitFieldWriter which essentially does the same thing (except it deals with ints)? {quote} Utils.bytesToLongBE should take an input stream rather than a byte[]. {quote} Done. {quote} In IntegerCompressionReader: I'd write a method to translate the int into an opcode rather than use ordinal. It is probably worth remembering that you are in a repeat, so that you don't need to copy the value N times in short repeat. {quote} Done. {quote} It may be easier to loop through the base values and then run through the patches. You might even do three loops: unpack the main values, unpack the patches, add the base to each value. {quote} My initial implementation was running through 3 loops. But later I refactored it to do in a single loop. I think this current patch removed some complexity (removed zigzag and changed bitpacking). {quote} For patched based only the base is zigzag encoded. The rest of the values are always positive. For delta only the base and base delta are zigzag encoded. {quote} Good catch. Updated the patch. {quote} In IntegerCompressionWriter: You should give more comments about the patched base encoding. Instead of sorting for the percentiles, you could keep a count of how many values use each number of bits. {quote} Done. Nice idea! {quote} Replace the commented out printlns with LOG.debug surrounded by LOG.ifDebugEnabled flush should use if/then/else to prevent writing the data twice the constructor should probably call clear rather than risk having the default values be different in write, just copy the data with system.arraycopy instead of cloning the array {quote} Done. {quote} write should track whether the values are monotonically increasing or decreasing so that we know if delta applies there is a lot of duplication of effort in determine encoding {quote} write primarily deals with cutting the runs (determining the scope). There was some redundancy that I removed in the current patch. Also tracking min/max was wrong with the earlier which is fixed in the new patch. Earlier as and when a value is buffered min/max are updated. But this lead to wrong output in some cases. For example: 2 3 4 5 6 1 1 1 sequence has min value of 1, but this 1 is part of short repeat sequence. This same min value was used for initial delta run as well. min/max/monotonicity/delta computation/percentile are determined while iterating through the buffered values. {quote} if the sequence is both increasing and decreasing, it is constant and we should either use short literal or delta depending on the length delta encoding should return before doing the percentile work {quote} Currently, delta encoding returns before percentile computation. Short repeats are determined when buffering values. All other encodings are determined in determineEncoding(). {quote} How much unit test coverage do you have of the new code? {quote} I have unit tests to cover the basic cases (runs that covers all the encodings) and unit tests for edges cases of each encodings. I also have a separate synthetic datagen utility that generates sorted/signed/unsigned/sparse/dense data. Should I add that datagen utility to unit tests as well? {quote} Have you run the encoder/decoder round trip over the github data to test it? {quote} Yes. I ran it over 237 columns of github data. Apart from github data, I ran it over aol querylog timestamp, Nasdaq stocks data, twitter api/search id dataset and few columns from httparchive dataset. I am still doing some more refactoring/code cleaning. I will post another patch in case of any changes. This patch covers most of the code review changes. > The RLE encoding for ORC can be improved > ---------------------------------------- > > Key: HIVE-4123 > URL: https://issues.apache.org/jira/browse/HIVE-4123 > Project: Hive > Issue Type: New Feature > Components: File Formats > Reporter: Owen O'Malley > Assignee: Prasanth J > Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, > ORC-Compression-Ratio-Comparison.xlsx > > > The run length encoding of integers can be improved: > * tighter bit packing > * allow delta encoding > * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira