[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678850#comment-16678850
 ] 

ASF GitHub Bot commented on ORC-420:
------------------------------------

wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r228390402
 
 

 ##########
 File path: c++/src/ColumnWriter.cc
 ##########
 @@ -831,6 +843,121 @@ namespace orc {
     dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+    struct DictEntry {
+      DictEntry(const char * str, size_t len):data(str),length(len) {}
+      const char * data;
+      size_t length;
+    };
+
+    StringDictionary():totalLength(0) {}
+
+    // insert a new string into dictionary, return its insertion order
+    size_t insert(const char * data, size_t len);
+
+    // write dictionary data & length to output buffer
+    void flush(AppendOnlyBufferedStream * dataStream,
+               RleEncoder * lengthEncoder) const;
+
+    // reorder input index buffer from insertion order to dictionary order
+    void reorder(std::vector<int64_t>& idxBuffer) const;
+
+    // get dict entries in insertion order
+    std::vector<const DictEntry *> entriesInInsertionOrder() const;
+
+    // return count of entries
+    size_t size() const;
+
+    // return total length of strings in the dictioanry
+    uint64_t length() const;
+
+    void clear();
+
+  private:
+    struct LessThan {
+      bool operator()(const DictEntry& left, const DictEntry& right) const {
+        int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+        if (ret != 0) {
+          return ret < 0;
+        }
+        return left.length < right.length;
+      }
+    };
+
+    std::map<DictEntry, size_t, LessThan> dict;
+    std::vector<std::vector<char>> data;
 
 Review comment:
   yes, I can change std::vector<char> to std::string once your review is done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> ---------------------------------------------------------
>
>                 Key: ORC-420
>                 URL: https://issues.apache.org/jira/browse/ORC-420
>             Project: ORC
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Gang Wu
>            Assignee: Gang Wu
>            Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to