Todd West created ARROW-17391:
---------------------------------

             Summary: arrow::read_feather() cannot read DictionaryArray written 
from C#
                 Key: ARROW-17391
                 URL: https://issues.apache.org/jira/browse/ARROW-17391
             Project: Apache Arrow
          Issue Type: Bug
          Components: C#, R
    Affects Versions: 9.0.1
            Reporter: Todd West
             Fix For: 9.0.1


This applies to Arrow 9.0.0, both the C# nuget and R package, but for some 
reason 9.0.0 isn't in the issue dropdowns' list of released versions. It also 
appears the [implementation status 
page|https://arrow.apache.org/docs/status.html#ipc-format] may be stale as the 
C#  source contains 
[DictionaryArray|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs]
 and a look in the debugger confirms the flags flip and the data structures 
update for 
[ArrowStreamWriter|https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs]
 having correctly received both the dictionary index and value arrays it's 
given on the code paths which write a [dictionary 
batch|https://arrow.apache.org/docs/format/Columnar.html] . However, on the R 
side, read_feather() fails with

{{Error: Key error: Dictionary with id 1 not found}}

So it appears most likely either C# isn't properly emitting the dictionary 
batch, despite seeming to have all the code to do so, or something's going 
wrong in the C++ layers under R in the reading side.

Setup on the C# side is simple

{{        public static DictionaryArray CreateStringTable(Memory<byte> 
indicies, IList<string> values)}}
{{        {}}
{{            StringArray.Builder valueArray = new();}}
{{            for (int valueIndex = 0; valueIndex < values.Count; 
++valueIndex)}}
{{            {}}
{{                valueArray.Append(values[valueIndex]);}}
{{            }}}{{            UInt8Array indexArray = 
new(ArrowArrayExtensions.WrapInArrayData(UInt8Type.Default, indicies, 
indicies.Length));}}
{{            return new DictionaryArray(new(UInt8Type.Default, 
StringType.Default, false), indexArray, valueArray.Build());}}
{{        }}}

as is the R

{{        library(arrow)}}
{{        foo = read_feather("test.feather")}}

If I drop the dictionary column the two Arrow implementations interop without 
difficulty. Same if I write only the indices as a UInt8 column. So the issue 
here is clearly specific to the use of DictionaryColumn. I've also tried other 
index sizes, so it doesn't appear specific to the use of UInt8.

I'm therefore left with two questions:

1) Does DictionaryArray have working use cases in 9.0.0?

2) If what I'm doing's not supposed to work yet, or I'm not getting the data 
structures set up correctly (there's no C# DictionaryArray example [on 
github|https://github.com/apache/arrow/tree/master/csharp/examples]), is there 
an array level workaround?

There's only one string table in this schema and it's typically tiny (five 
values or less) so putting its values part in the schema metadata is a viable 
workaround, albeit an inelegant one.

Not seeing that there's a feather file viewer available but, if there is, I'd 
be happy to take a closer look. Can also link the sources after they've been 
committed and pushed, which should be by the end of the day tomorrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to