lwhite1 commented on code in PR #14213:
URL: https://github.com/apache/arrow/pull/14213#discussion_r979048882


##########
docs/source/java/vector.rst:
##########
@@ -268,6 +268,82 @@ For example, the code below shows how to build a 
:class:`ListVector` of int's us
      }
   }
 
+Dictionary Encoding
+===================
+
+A :class:`FieldVector` can be dictionary encoded for performance or improved 
memory efficiency. While this is most often done with :class:`VarCharVector`, 
nearly any type of vector might be encoded if there are many values, but few 
unique values.
+
+There are a few steps involved in the encoding process:
+
+1. Create a regular, un-encoded vector and populate it
+2. Create a dictionary vector of the same type as the un-encoded vector. This 
vector must have the same values, but each unique value in the un-encoded 
vector need appear here only once.
+3. Create a :class:`Dictionary`. It will contain the dictionary vector, plus a 
:class:`DictionaryEncoding` object that holds the encoding's metadata and 
settings values.
+4. Create a :class:`DictionaryEncoder`.
+5. Call the encode() method on the :class:`DictionaryEncoder` to produce an 
encoded version of the original vector.
+6. (Optional) Call the decode() method on the encoded vector to re-create the 
original values.
+
+The encoded values will be integers. Depending on how many unique values you 
have, you can use either TinyIntVector, SmallIntVector, or IntVector to hold 
them. You specify the type when you create your :class:`DictionaryEncoding` 
instance. You might wonder where those integers come from: the dictionary 
vector is a regular vector, so the value's index position in that vector is 
used as its encoded value.

Review Comment:
   I think it's nominally supported, but I'm not sure we fully support more 
than Integer.MAXVALUE values in a single vector. 
   
   for example, this method seems to support bigInt, but that branch of the if 
statement cannot execute.  
   
   ```
     /**
      * Get the indexType according to the dictionary vector valueCount.
      * @param valueCount dictionary vector valueCount.
      * @return index type.
      */
     public static ArrowType.Int getIndexType(int valueCount) {
       Preconditions.checkArgument(valueCount >= 0);
       if (valueCount <= Byte.MAX_VALUE) {
         return new ArrowType.Int(8, true);
       } else if (valueCount <= Character.MAX_VALUE) {
         return new ArrowType.Int(16, true);
       } else if (valueCount <= Integer.MAX_VALUE) {
         return new ArrowType.Int(32, true);
       } else {
         return new ArrowType.Int(64, true);
       }
     }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to