[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680313#comment-16680313
 ] 

ASF GitHub Bot commented on ORC-420:
------------------------------------

xndai commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r231735751
 
 

 ##########
 File path: c++/src/ColumnWriter.cc
 ##########
 @@ -831,6 +843,137 @@ namespace orc {
     dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+    struct DictEntry {
+      DictEntry(const char * str, size_t len):data(str),length(len) {}
+      const char * data;
+      size_t length;
+    };
+
+    StringDictionary():totalLength(0) {}
+
+    // insert a new string into dictionary, return its insertion order
+    size_t insert(const char * data, size_t len);
+
+    // write dictionary data & length to output buffer
+    void flush(AppendOnlyBufferedStream * dataStream,
+               RleEncoder * lengthEncoder) const;
+
+    // reorder input index buffer from insertion order to dictionary order
+    void reorder(std::vector<int64_t>& idxBuffer) const;
+
+    // get dict entries in insertion order
+    void getEntriesInInsertionOrder(std::vector<const DictEntry *>&) const;
+
+    // return count of entries
+    size_t size() const;
+
+    // return total length of strings in the dictioanry
+    uint64_t length() const;
+
+    void clear();
+
+  private:
+    struct LessThan {
+      bool operator()(const DictEntry& left, const DictEntry& right) const {
+        int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+        if (ret != 0) {
+          return ret < 0;
+        }
+        return left.length < right.length;
+      }
+    };
+
+    std::map<DictEntry, size_t, LessThan> dict;
+    std::vector<std::vector<char>> data;
+    uint64_t totalLength;
+
+    // use friend class here to avoid being bothered by const function calls
+    friend class StringColumnWriter;
+    friend class CharColumnWriter;
+    friend class VarCharColumnWriter;
+    // store indexes of insertion order in the dictionary for not-null rows
+    std::vector<int64_t> idxInDictBuffer;
 
 Review comment:
   does it have to be int64? Or uint32 is enough?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> ---------------------------------------------------------
>
>                 Key: ORC-420
>                 URL: https://issues.apache.org/jira/browse/ORC-420
>             Project: ORC
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Gang Wu
>            Assignee: Gang Wu
>            Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to