[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

GitBox Wed, 14 Apr 2021 14:53:41 -0700


autumnust commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r613611182




##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 
bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no 
reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, 
MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, 
this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl 
context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, 
this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE 
+ 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;

Review comment:
       @pgaref  it was meant to remove the highest sign bits to ensure the 
dividend is always a positive value. But yeah I will take Owen's suggestion to 
use the Math library instead for better readability. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [orc] autumnust commented on a change in pull request #651: ORC-757: HashTable dictionary

Reply via email to