lidavidm commented on code in PR #14213: URL: https://github.com/apache/arrow/pull/14213#discussion_r979055429
########## docs/source/java/vector.rst: ########## @@ -268,6 +268,82 @@ For example, the code below shows how to build a :class:`ListVector` of int's us } } +Dictionary Encoding +=================== + +A :class:`FieldVector` can be dictionary encoded for performance or improved memory efficiency. While this is most often done with :class:`VarCharVector`, nearly any type of vector might be encoded if there are many values, but few unique values. + +There are a few steps involved in the encoding process: + +1. Create a regular, un-encoded vector and populate it +2. Create a dictionary vector of the same type as the un-encoded vector. This vector must have the same values, but each unique value in the un-encoded vector need appear here only once. +3. Create a :class:`Dictionary`. It will contain the dictionary vector, plus a :class:`DictionaryEncoding` object that holds the encoding's metadata and settings values. +4. Create a :class:`DictionaryEncoder`. +5. Call the encode() method on the :class:`DictionaryEncoder` to produce an encoded version of the original vector. +6. (Optional) Call the decode() method on the encoded vector to re-create the original values. + +The encoded values will be integers. Depending on how many unique values you have, you can use either TinyIntVector, SmallIntVector, or IntVector to hold them. You specify the type when you create your :class:`DictionaryEncoding` instance. You might wonder where those integers come from: the dictionary vector is a regular vector, so the value's index position in that vector is used as its encoded value. Review Comment: It's probably easiest to just use the double-backticks to put it in a monospace font, since what I'm saying is that the `:class:` markup won't actually link it properly since there's no integration between Javadoc and Sphinx ########## docs/source/java/vector.rst: ########## @@ -268,6 +268,82 @@ For example, the code below shows how to build a :class:`ListVector` of int's us } } +Dictionary Encoding +=================== + +A :class:`FieldVector` can be dictionary encoded for performance or improved memory efficiency. While this is most often done with :class:`VarCharVector`, nearly any type of vector might be encoded if there are many values, but few unique values. + +There are a few steps involved in the encoding process: + +1. Create a regular, un-encoded vector and populate it +2. Create a dictionary vector of the same type as the un-encoded vector. This vector must have the same values, but each unique value in the un-encoded vector need appear here only once. +3. Create a :class:`Dictionary`. It will contain the dictionary vector, plus a :class:`DictionaryEncoding` object that holds the encoding's metadata and settings values. +4. Create a :class:`DictionaryEncoder`. +5. Call the encode() method on the :class:`DictionaryEncoder` to produce an encoded version of the original vector. +6. (Optional) Call the decode() method on the encoded vector to re-create the original values. + +The encoded values will be integers. Depending on how many unique values you have, you can use either TinyIntVector, SmallIntVector, or IntVector to hold them. You specify the type when you create your :class:`DictionaryEncoding` instance. You might wonder where those integers come from: the dictionary vector is a regular vector, so the value's index position in that vector is used as its encoded value. Review Comment: (You should probably have seen build warnings to this effect) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org