[ https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261996#comment-17261996 ]
Lance Dacey edited comment on ARROW-10247 at 1/10/21, 3:27 AM: --------------------------------------------------------------- What is the best workaround for this issue right now? I was playing around with making a new partition schema if a dictionary type was found in my partition columns: {code:java} partitioning = None part_schema = t.select(["project", "date"]).schema fields = [] for part in part_schema: if pa.types.is_dictionary(part.type): fields.append(pa.field(part.name, part.type.value_type)) else: fields.append(pa.field(part.name, part.type)) new_schema = pa.schema(fields) partitioning = ds.partitioning(new_schema, flavor="hive") {code} This seems to work for me. My only issue is if I have multiple partition columns with different types. This would return an error when I read the dataset with ds.dataset(): {code:java} partitioning = ds.partitioning(pa.schema([('date', pa.date32()), ("project", pa.dictionary(index_type=pa.int32(), value_type=pa.string()))]), flavor="hive"){code} ArrowInvalid: No dictionary provided for dictionary field project: dictionary<values=string, indices=int32, ordered=0> And this returns dictionaries for both partitions (instead of date being pa.date32()) which is not ideal: {code:java} partitioning=ds.HivePartitioning.discover(infer_dictionary=True){code} was (Author: ldacey): What is the best workaround for this issue right now? If a column in the partition columns is_dictionary(), then convert it to pa.string() to save the dataset and then use ds.HivePartitioning.discover(infer_dictionary=True) to read the dataset later? > [C++][Dataset] Cannot write dataset with dictionary column as partition field > ----------------------------------------------------------------------------- > > Key: ARROW-10247 > URL: https://issues.apache.org/jira/browse/ARROW-10247 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Ben Kietzman > Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When the column to use for partitioning is dictionary encoded, we get this > error: > {code} > In [9]: import pyarrow.dataset as ds > In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 > ...: table = pa.table([ > ...: pa.array(range(len(part))), > ...: pa.array(part).dictionary_encode(), > ...: ], names=['col', 'part']) > In [11]: part = ds.partitioning(table.select(["part"]).schema) > In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > --------------------------------------------------------------------------- > ArrowTypeError Traceback (most recent call last) > <ipython-input-12-c7b81c9b0bda> in <module> > ----> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, > base_dir, basename_template, format, partitioning, schema, filesystem, > file_options, use_threads) > 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > --> 775 filesystem, partitioning, file_options, use_threads, > 776 ) > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowTypeError: scalar xxx (of type string) is invalid for part: > dictionary<values=string, indices=int32, ordered=0> > In ../src/arrow/dataset/filter.cc, line 1082, code: > VisitConjunctionMembers(*and_.left_operand(), visitor) > In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, > [&](const std::string& name, const std::shared_ptr<Scalar>& value) { auto&& > _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { > ::arrow::Status __s = > ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if > ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); > _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, > "(_error_or_value28).status()"); return _st; } } while (0); } while (false); > auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const > auto& field = schema_->field(match[0]); if > (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", > value->ToString(), " (of type ", *value->type, ") is invalid for ", > field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); > }) > In ../src/arrow/dataset/file_base.cc, line 321, code: > (_error_or_value24).status() > In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() > {code} > While this seems a quit normal use case, as this column will typically be > repeated many times (and we also support reading it as such with dictionary > type, so a roundtrip is currently not possible in that case) > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)