felipecrv commented on code in PR #38394:
URL: https://github.com/apache/arrow/pull/38394#discussion_r1428388935


##########
cpp/src/arrow/compute/kernels/vector_hash.cc:
##########
@@ -457,14 +457,17 @@ class DictionaryHashKernel : public HashKernel {
       // A better approach may be to run the kernel over each individual chunk,
       // and then hash-aggregate all results (for example sum-group-by for
       // the "value_counts" kernel).
+      if (dictionary_unifier_ == nullptr) {
+        ARROW_ASSIGN_OR_RAISE(dictionary_unifier_,
+                              DictionaryUnifier::Make(dictionary_->type()));
+        RETURN_NOT_OK(dictionary_unifier_->Unify(*dictionary_));
+      }
       auto out_dict_type = dictionary_->type();
       std::shared_ptr<Buffer> transpose_map;
       std::shared_ptr<Array> out_dict;
-      ARROW_ASSIGN_OR_RAISE(auto unifier, 
DictionaryUnifier::Make(out_dict_type));
 
-      ARROW_CHECK_OK(unifier->Unify(*dictionary_));
-      ARROW_CHECK_OK(unifier->Unify(*arr_dict, &transpose_map));
-      ARROW_CHECK_OK(unifier->GetResult(&out_dict_type, &out_dict));
+      RETURN_NOT_OK(dictionary_unifier_->Unify(*arr_dict, &transpose_map));
+      RETURN_NOT_OK(dictionary_unifier_->GetResult(&out_dict_type, &out_dict));
 
       dictionary_ = out_dict;

Review Comment:
   Do we even need `dictionary_` to be a member variable now? Wouldn't it 
suffice to perform a single `DictionaryUnifier::GetResult` call at the end?



##########
cpp/src/arrow/compute/kernels/vector_hash.cc:
##########
@@ -457,14 +457,17 @@ class DictionaryHashKernel : public HashKernel {
       // A better approach may be to run the kernel over each individual chunk,
       // and then hash-aggregate all results (for example sum-group-by for
       // the "value_counts" kernel).
+      if (dictionary_unifier_ == nullptr) {
+        ARROW_ASSIGN_OR_RAISE(dictionary_unifier_,
+                              DictionaryUnifier::Make(dictionary_->type()));
+        RETURN_NOT_OK(dictionary_unifier_->Unify(*dictionary_));
+      }
       auto out_dict_type = dictionary_->type();
       std::shared_ptr<Buffer> transpose_map;
       std::shared_ptr<Array> out_dict;
-      ARROW_ASSIGN_OR_RAISE(auto unifier, 
DictionaryUnifier::Make(out_dict_type));
 
-      ARROW_CHECK_OK(unifier->Unify(*dictionary_));
-      ARROW_CHECK_OK(unifier->Unify(*arr_dict, &transpose_map));
-      ARROW_CHECK_OK(unifier->GetResult(&out_dict_type, &out_dict));
+      RETURN_NOT_OK(dictionary_unifier_->Unify(*arr_dict, &transpose_map));
+      RETURN_NOT_OK(dictionary_unifier_->GetResult(&out_dict_type, &out_dict));
 
       dictionary_ = out_dict;

Review Comment:
   Documentation for `DictionaryUnifier::GetResult` says the unifier can't be 
re-used after a call to `GetResult` [1].
   
   My suggestion (that *I think* will work well):
   
   Rename `dictionary_` to `first_dictionary_` and change `Append(arr)` to 
transition through this state machine on each call:
   
   ```cpp
   ```cpp
   // 
--------------------------------------------------------------------------------------------------------------
   //  Current State                                     Next State
   // 
--------------------------------------------------------------------------------------------------------------
   //  !first_dictionary_ && !dictionary_unifier_   -->  first_dictionary_ = 
arr_dict_
   //                                                    UNCHANGED 
dictionary_unifier_
   // 
--------------------------------------------------------------------------------------------------------------
   //   first_dictionary_ && !dictionary_unifier_   -->  if 
!first_dictionary_.Equals(arr_dict) then
   //                                                       dictionary_unifier_ 
= unify(first_dictionary_, arr_dict)
   //                                                       first_dictionary_ = 
nullptr
   //                                                    else
   //                                                       UNCHANGED 
first_dictionary_, dictionary_unifier_
   //                                                    end
   // 
--------------------------------------------------------------------------------------------------------------
                            dictionary_unifier_    -->  dictionary_unifier_ = 
unify(dictionary_unifier_, arr_dict)
   ```
   
   You will then have to re-think how `dictionary_value_type` and `dictionary` 
work below.
   
   [1] 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/array_dict.h#L169



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to