[jira] [Comment Edited] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

Lance Dacey (Jira) Sat, 09 Jan 2021 19:28:14 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261996#comment-17261996
 ]


Lance Dacey edited comment on ARROW-10247 at 1/10/21, 3:27 AM:
---------------------------------------------------------------

What is the best workaround for this issue right now? I was playing around with 
making a new partition schema if a dictionary type was found in my partition 
columns:

 
{code:java}
partitioning = None
part_schema = t.select(["project", "date"]).schema
fields = []
for part in part_schema:
    if pa.types.is_dictionary(part.type):
        fields.append(pa.field(part.name, part.type.value_type))
    else:
        fields.append(pa.field(part.name, part.type))
new_schema = pa.schema(fields)
partitioning = ds.partitioning(new_schema, flavor="hive")
{code}
This seems to work for me. My only issue is if I have multiple partition 
columns with different types.

This would return an error when I read the dataset with ds.dataset():
{code:java}
partitioning = ds.partitioning(pa.schema([('date', pa.date32()), ("project", 
pa.dictionary(index_type=pa.int32(), value_type=pa.string()))]), 
flavor="hive"){code}
ArrowInvalid: No dictionary provided for dictionary field project: 
dictionary<values=string, indices=int32, ordered=0>

 

And this returns dictionaries for both partitions (instead of date being 
pa.date32()) which is not ideal:
{code:java}
partitioning=ds.HivePartitioning.discover(infer_dictionary=True){code}


was (Author: ldacey):
What is the best workaround for this issue right now? If a column in the 
partition columns is_dictionary(), then convert it to pa.string() to save the 
dataset and then use ds.HivePartitioning.discover(infer_dictionary=True) to 
read the dataset later?

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10247
>                 URL: https://issues.apache.org/jira/browse/ARROW-10247
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
>     ...: table = pa.table([
>     ...:     pa.array(range(len(part))),
>     ...:     pa.array(part).dictionary_encode(),
>     ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---------------------------------------------------------------------------
> ArrowTypeError                            Traceback (most recent call last)
> <ipython-input-12-c7b81c9b0bda> in <module>
> ----> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
>     773     _filesystemdataset_write(
>     774         data, base_dir, basename_template, schema,
> --> 775         filesystem, partitioning, file_options, use_threads,
>     776     )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary<values=string, indices=int32, ordered=0>
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr<Scalar>& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be 
> repeated many times (and we also support reading it as such with dictionary 
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

Reply via email to