[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

David Li (Jira) Sat, 29 Aug 2020 12:37:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187088#comment-17187088
 ]


David Li commented on ARROW-8749:
---------------------------------

This was fixed by Antoine's patch in ARROW-9960.

Small reproducer for Python:
{code:python}
from io import BytesIO
import pyarrow as pa

s1 = pa.schema([pa.field("foo", pa.dictionary(pa.int16(), pa.string()))])
s2 = pa.schema([pa.field("foo", pa.dictionary(pa.int16(), pa.string()))])

t1 = pa.Table.from_arrays([pa.DictionaryArray.from_arrays(pa.array([0, 1, 2, 0, 
1], type=pa.int16()), pa.array(['a', 'b', 'c']))], schema=s1)
t2 = pa.Table.from_arrays([pa.DictionaryArray.from_arrays(pa.array([0, 1, 2, 0, 
1], type=pa.int16()), pa.array(['a', 'b', 'c']))], schema=s2)

sink = BytesIO()

writer = pa.RecordBatchStreamWriter(sink, s2)
writer.write(t1)
writer.write(t2)
writer.close()
print(pa.RecordBatchStreamReader(sink.getvalue()).read_all())
{code}
With Arrow 1.0.1 it gives
{noformat}
Traceback (most recent call last):
  File "arrow8749.py", line 16, in <module>
    print(pa.RecordBatchStreamReader(sink.getvalue()).read_all())
  File "pyarrow/ipc.pxi", line 445, in pyarrow.lib._CRecordBatchReader.read_all
  File "pyarrow/error.pxi", line 103, in pyarrow.lib.check_status
pyarrow.lib.ArrowKeyError: No record of dictionary type with id 1
{noformat}
With the nightly, it passes.

> [C++] IpcFormatWriter writes dictionary batches with wrong ID
> -------------------------------------------------------------
>
>                 Key: ARROW-8749
>                 URL: https://issues.apache.org/jira/browse/ARROW-8749
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.16.0, 0.17.0
>            Reporter: David Li
>            Priority: Major
>             Fix For: 2.0.0
>
>
> IpcFormatWriter assigns dictionary IDs once when it writes the schema 
> message. Then, when it writes dictionary batches, it assigns dictionary IDs 
> again because it re-collects dictionaries from the given batch. So for 
> example, if you have 5 dictionaries, the first dictionary will end up with ID 
> 0 but be written with ID 5.
> For example, this will fail with "'_error_or_value11.status()' failed with 
> Key error: No record of dictionary type with id 9"
> {code:cpp}
> TEST_F(TestMetadata, DoPutDictionaries) {
>   ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
>   std::shared_ptr<Schema> schema = ExampleDictSchema();
>   BatchVector expected_batches;
>   ASSERT_OK(ExampleDictBatches(&expected_batches));
>   ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
> schema));
>   for (auto& batch : expected_batches) {
>     ASSERT_OK(writer->WriteRecordBatch(*batch));
>   }
>   ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
>   arrow::io::BufferReader source(buf);
>   ASSERT_OK_AND_ASSIGN(auto reader, 
> arrow::ipc::RecordBatchStreamReader::Open(&source));
>   AssertSchemaEqual(schema, reader->schema());
>   for (auto& batch : expected_batches) {
>     ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
>     AssertBatchesEqual(*actual, *batch);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8749) [C++] IpcFormatWriter writes dictionary batches with wrong ID

Reply via email to