[ 
https://issues.apache.org/jira/browse/ARROW-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441223#comment-17441223
 ] 

Joris Van den Bossche commented on ARROW-14564:
-----------------------------------------------

This is a limitation of the Parquet format itself. Pyarrow defaults to writing 
version 1.0 parquet files, which don't yet have support for uint32 type. 
However, you can specify {{version="2.4"}}  in the {{write_table}} call, and 
then it should preserve both unsigned integer types. 

We plan to switch to the newer parquet version by default in the near future 
(see ARROW-12203 and the linked issues)

> [python] uint32 incorrectly saves to Parquet as int64
> -----------------------------------------------------
>
>                 Key: ARROW-14564
>                 URL: https://issues.apache.org/jira/browse/ARROW-14564
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.0
>         Environment: Ubuntu 20.10, Python 3.8.10
>            Reporter: Bruce Allen
>            Priority: Major
>         Attachments: test_u32.py
>
>
> Function pyarrow.parquet.write_table() incorrectly saves data of type 
> unsigned int32 as signed int64.  Code test_u32.py showing failure is attached.
> Output from running test_u32.py indicating faulty retyping:
> pyarrow version: 6.0.0
> numpy data:
> [(1, 2) (3, 4)]
> [('my_u2', '<u2'), ('my_u4', '<u4')]
> result:
>  my_u2 my_u4
> 0 1 2
> 1 3 4
> my_u2 uint16
> my_u4 int64
> dtype: object
>  
> We can also observe that the incorrect int64 type is in the Parquet file by 
> using the "parq" tool:
> $ parq _test_u32_pq --schema
> # Schema 
>  <pyarrow._parquet.ParquetSchema object at 0x7ff2e40b2a40>
> required group field_id=-1 schema {
>  optional int32 field_id=-1 my_u2 (Int(bitWidth=16, isSigned=false));
>  optional int64 field_id=-1 my_u4;
> }



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to