westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568264655
##########
File path: cpp/src/arrow/pretty_print.cc
##########
@@ -670,4 +670,49 @@ Status PrettyPrint(const Schema& schema, const
PrettyPrintOptions& options,
return Status::OK();
}
+void GdbPrintArray(const Array& arr, int indent) {
Review comment:
Removed.
##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -63,6 +63,25 @@ enum class SortOrder {
Descending,
};
+struct DictionaryEncodeOptions : public FunctionOptions {
+ /// Configure how null values will be encoded
+ enum NullEncodingBehavior {
+ /// the null value will be added to the dictionary with a proper index
+ ENCODE,
+ /// the null value will be masked in the indices array
+ MASK,
+ /// the null value will not be included in the dictionary
+ SKIP
Review comment:
I removed SKIP and added some comments similar to those in filter.
##########
File path: cpp/src/arrow/compute/api_vector.h
##########
@@ -296,7 +315,8 @@ Result<std::shared_ptr<StructArray>> ValueCounts(const
Datum& value,
/// \since 1.0.0
/// \note API not yet finalized
ARROW_EXPORT
-Result<Datum> DictionaryEncode(const Datum& data, ExecContext* ctx = NULLPTR);
+Result<Datum> DictionaryEncode(const Datum& data, const
DictionaryEncodeOptions& options,
Review comment:
Done, and cleaned up all the other call sites explicitly stating the
default options.
##########
File path: cpp/src/arrow/compute/kernels/vector_hash.cc
##########
@@ -451,8 +487,12 @@ struct HashKernelTraits<Type, Action,
enable_if_has_string_view<Type>> {
template <typename Type, typename Action>
std::unique_ptr<HashKernel> HashInitImpl(KernelContext* ctx, const
KernelInitArgs& args) {
using HashKernelType = typename HashKernelTraits<Type, Action>::HashKernel;
- auto result =
::arrow::internal::make_unique<HashKernelType>(args.inputs[0].type,
-
ctx->memory_pool());
+ DictionaryEncodeOptions options;
+ if (auto options_ptr = static_cast<const
DictionaryEncodeOptions*>(args.options)) {
Review comment:
I moved the options decoding into Action although the `HashKernel` still
needed to be aware of it (to decide if `null` should go in the dictionary or
not) so it changed the kernel->action interface slightly (added
`ShouldEncodeNulls`)
##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,9 +621,17 @@ class StructDictionary {
private:
Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
- if (column.type()->id() != Type::DICTIONARY) {
- ARROW_ASSIGN_OR_RAISE(column,
compute::DictionaryEncode(std::move(column)));
+ if (column.type()->id() == Type::DICTIONARY) {
+ // compute::DictionaryEncode doesn't support dictionary and, even if it
did, it
+ // would be a null op and return a flat dictionary. In order to group
by dictionary
+ // we would need to be able to create a nested dictionary.
+ return Status::NotImplemented(
+ "Cannot use column of type dictionary as grouping criteria");
Review comment:
Yes, I keep thinking of dictionary arrays as arrays of dictionaries
[{"a": 1, "b": 2}, ...] and not simply a different encoding. My brain was off
on a tangent when I wrote this column.
I added the logic to encode nulls by decoding (casting) the dictionary first
and then re-encoding. I also added a test case for this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]