[ https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490156#comment-17490156 ]
Ravi Gummadi commented on ARROW-15645: -------------------------------------- The issue is seen only with pyarrow 5.0.0 and is not seen with pyarrow 3.0.0. _________________________ Some investigation details from my side while debugging validate(): The offsets are having opposite byte-order with pyarrow 5.0.0 (from validate.cc) (gdb) p data.buffers[1]->data() $8 = (const uint8_t *) 0x3fff9680040 "" (gdb) p data.buffers[1]->data()[4] $9 = 26 '\032' The same flight server is used for reading data. When I run the above sample code with pyarrow 3.0.0, I see the following correct offset values with the right byte-order. >From data.h GetValues() called from validate.cc: (gdb) p buffers[1]->data() $11 = (const uint8_t *) 0x2aa00a544b2 "" (gdb) p buffers[1]->data()[4] $12 = 0 '\000' (gdb) p buffers[1]->data()[5] $13 = 0 '\000' (gdb) p buffers[1]->data()[6] $14 = 0 '\000' (gdb) p buffers[1]->data()[7] $15 = 26 '\032' > Data read through Flight is having endianness issue on s390x > ------------------------------------------------------------ > > Key: ARROW-15645 > URL: https://issues.apache.org/jira/browse/ARROW-15645 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC, Python > Affects Versions: 5.0.0 > Environment: Linux s390x (big endian) > Reporter: Ravi Gummadi > Priority: Major > > Am facing an endianness issue on s390x(big endian) when converting the data > read through flight to pandas data frame. > (1) table.validate() fails with error > Traceback (most recent call last): > File "/tmp/2.py", line 51, in <module> > table.validate() > File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in > binary array > (2) table.to_pandas() gives a segmentation fault > ____________ > Here is a sample code that I am using: > from pyarrow import flight > import os > import json > flight_endpoint = os.environ.get("flight_server_url", > "grpc+tls://...local:443") > print(flight_endpoint) > # > class TokenClientAuthHandler(flight.ClientAuthHandler): > """An example implementation of authentication via handshake. > With the default constructor, the user token is read from the > environment: TokenClientAuthHandler(). > You can also pass a user token as parameter to the constructor, > TokenClientAuthHandler(yourtoken). > """ > def {_}__{_}init{_}__{_}(self, token: str = None): > super().{_}__{_}init{_}__{_}() > if( token != None): > strToken = strToken = 'Bearer {}'.format(token) > else: > strToken = 'Bearer {}'.format(os.environ.get("some_auth_token")) > self.token = strToken.encode('utf-8') > #print(self.token) > def authenticate(self, outgoing, incoming): > outgoing.write(self.token) > self.token = incoming.read() > def get_token(self): > return self.token > > readClient = flight.FlightClient(flight_endpoint) > readClient.authenticate(TokenClientAuthHandler()) > cmd = json.dumps(\{...}) > descriptor = flight.FlightDescriptor.for_command(cmd) > flightInfo = readClient.get_flight_info(descriptor) > reader = readClient.do_get(flightInfo.endpoints[0].ticket) > table = reader.read_all() > print(table) > print(table.num_columns) > print(table.num_rows) > table.validate() > table.to_pandas() -- This message was sent by Atlassian Jira (v8.20.1#820001)