[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

GitBox Tue, 02 Feb 2021 16:58:35 -0800


westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568264655




##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const 
PrettyPrintOptions& options,
   return Status::OK();
 }
 
+void GdbPrintArray(const Array& arr, int indent) {

Review comment:
       Removed.

##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
   Descending,
 };
 
+struct DictionaryEncodeOptions : public FunctionOptions {
+  /// Configure how null values will be encoded
+  enum NullEncodingBehavior {
+    /// the null value will be added to the dictionary with a proper index
+    ENCODE,
+    /// the null value will be masked in the indices array
+    MASK,
+    /// the null value will not be included in the dictionary
+    SKIP

Review comment:
       I removed SKIP and added some comments similar to those in filter.

##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -296,7 +315,8 @@ Result<std::shared_ptr<StructArray>> ValueCounts(const 
Datum& value,
 /// \since 1.0.0
 /// \note API not yet finalized
 ARROW_EXPORT
-Result<Datum> DictionaryEncode(const Datum& data, ExecContext* ctx = NULLPTR);
+Result<Datum> DictionaryEncode(const Datum& data, const 
DictionaryEncodeOptions& options,

Review comment:
       Done, and cleaned up all the other call sites explicitly stating the 
default options.

##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -451,8 +487,12 @@ struct HashKernelTraits<Type, Action, 
enable_if_has_string_view<Type>> {
 template <typename Type, typename Action>
 std::unique_ptr<HashKernel> HashInitImpl(KernelContext* ctx, const 
KernelInitArgs& args) {
   using HashKernelType = typename HashKernelTraits<Type, Action>::HashKernel;
-  auto result = 
::arrow::internal::make_unique<HashKernelType>(args.inputs[0].type,
-                                                               
ctx->memory_pool());
+  DictionaryEncodeOptions options;
+  if (auto options_ptr = static_cast<const 
DictionaryEncodeOptions*>(args.options)) {

Review comment:
       I moved the options decoding into Action although the `HashKernel` still 
needed to be aware of it (to decide if `null` should go in the dictionary or 
not) so it changed the kernel->action interface slightly (added 
`ShouldEncodeNulls`)

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,9 +621,17 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
-    if (column.type()->id() != Type::DICTIONARY) {
-      ARROW_ASSIGN_OR_RAISE(column, 
compute::DictionaryEncode(std::move(column)));
+    if (column.type()->id() == Type::DICTIONARY) {
+      // compute::DictionaryEncode doesn't support dictionary and, even if it 
did, it
+      // would be a null op and return a flat dictionary.  In order to group 
by dictionary
+      // we would need to be able to create a nested dictionary.
+      return Status::NotImplemented(
+          "Cannot use column of type dictionary as grouping criteria");

Review comment:
       Yes, I keep thinking of dictionary arrays as arrays of dictionaries 
[{"a": 1, "b": 2}, ...] and not simply a different encoding.  My brain was off 
on a tangent when I wrote this column.
   
   I added the logic to encode nulls by decoding (casting) the dictionary first 
and then re-encoding.  I also added a test case for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Reply via email to