[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497452#comment-17497452
 ] 

Antoine Pitrou commented on ARROW-15645:


Ok, so my guess is that both server (Java) and client (Python/C++) are on 
s390x, right?

On Arrow C++ 3.0.0, no conversion happens in either Java or C++, and it works 
since client and server have the same endianness (both big endian).

On Arrow C++ 4.0.0+, the Flight client reads the endianness information from 
the IPC stream. If the machine endianness doesn't match the stream endianness, 
endianness conversion is attempted by default.

Here is the problem: Arrow Java (and the Java Flight server) seems to always 
set the endianness information to "little" (even on a big endian machine). 
Arrow C++ interprets that information as meaning a conversion is needed, while 
the data is already in the right format.

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497439#comment-17497439
 ] 

Antoine Pitrou commented on ARROW-15645:


Is the client or the server on s390x?

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Ravi Gummadi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497337#comment-17497337
 ] 

Ravi Gummadi commented on ARROW-15645:
--

Flight server side is using java based arrow 6.0.1 version.
Client side pyarrow 5.0.0 or 6.0.0 or 7.0.0  all 3 versions are facing the 
above reported issue.

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-14 Thread Ravi Gummadi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492382#comment-17492382
 ] 

Ravi Gummadi commented on ARROW-15645:
--

[~kiszk] ,

I tried using pyarrow 6.0 on the client side and still the issue is seen.
So
(1) the issue is NOT there in pyarrow 3.0.0 on the client side and with flight 
server side arrow version 6.0.x
(2) the issue is seen with pyarrow 5.0.0 on the client side and flight server 
side arrow version 6.0.x
(3) the issue is seen with pyarrow 6.0.0 on the client side and flight server 
side arrow version 6.0.x

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-10 Thread Ravi Gummadi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490682#comment-17490682
 ] 

Ravi Gummadi commented on ARROW-15645:
--

Flight server side arrow version is 6.x

Any clues on why only pyarrow 5.0.0 has the issue and the issue is not seen 
with pyarrow 3.0.0 ? Where in the arrow source code the fix may have to go in ? 
Thanks

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-10 Thread Ravi Gummadi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490156#comment-17490156
 ] 

Ravi Gummadi commented on ARROW-15645:
--

The issue is seen only with pyarrow 5.0.0 and is not seen with pyarrow 3.0.0.
_
Some investigation details from my side while debugging validate():

The offsets are having opposite byte-order with pyarrow 5.0.0 (from validate.cc)

(gdb) p data.buffers[1]->data()
$8 = (const uint8_t *) 0x3fff9680040 ""
(gdb) p data.buffers[1]->data()[4]
$9 = 26 '\032'

The same flight server is used for reading data. When I run the above sample 
code with pyarrow 3.0.0, I see the following correct offset values with the 
right byte-order.
>From data.h GetValues() called from validate.cc:

(gdb) p buffers[1]->data()
$11 = (const uint8_t *) 0x2aa00a544b2 ""
(gdb) p buffers[1]->data()[4]
$12 = 0 '\000'
(gdb) p buffers[1]->data()[5]
$13 = 0 '\000'
(gdb) p buffers[1]->data()[6]
$14 = 0 '\000'
(gdb) p buffers[1]->data()[7]
$15 = 26 '\032'

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def {_}__{_}init{_}__{_}(self, token: str = None):
>         super().{_}__{_}init{_}__{_}()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)