[jira] [Created] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1
Philip Felton created ARROW-6395: Summary: [pyarrow] Bug when using bool arrays with stride greater than 1 Key: ARROW-6395 URL: https://issues.apache.org/jira/browse/ARROW-6395 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.0 Reporter: Philip Felton Here's code to reproduce it: {code:python} >>> import numpy as np >>> import pyarrow as pa >>> pa.__version__ '0.14.0' >>> xs = np.array([True, False, False, True, True, False, True, True, True, >>> False, False, False, False, False, True, False, True, True, True, True, >>> True]) >>> xs_sliced = xs[0::2] >>> xs_sliced array([ True, False, True, True, True, False, False, True, True, True, True]) >>> pa_xs = pa.array(xs_sliced, pa.bool_()) >>> pa_xs [ true, false, false, false, false, false, false, false, false, false, false ]{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct
Philip Felton created ARROW-6149: Summary: [Parquet] Decimal comparisons used for min/max statistics are not correct Key: ARROW-6149 URL: https://issues.apache.org/jira/browse/ARROW-6149 Project: Apache Arrow Issue Type: Bug Reporter: Philip Felton The [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet Format specifications] says bq. If the column uses int32 or int64 physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is fixed, then the correct ordering can be produced by flipping the most-significant bit in the first byte and then using unsigned byte-wise comparison. However this isn't followed in the C++ Parquet code. 16-byte decimal comparison is implemented using a lexicographical comparison of signed chars. This appears to be because the function [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] just goes off the sort_order (signed) and physical_type (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5630) [Python] Table of nested arrays doesn't round trip
Philip Felton created ARROW-5630: Summary: [Python] Table of nested arrays doesn't round trip Key: ARROW-5630 URL: https://issues.apache.org/jira/browse/ARROW-5630 Project: Apache Arrow Issue Type: Bug Environment: pyarrow 0.13, Windows 10 Reporter: Philip Felton This is pyarrow 0.13 on Windows. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def make_table(num_rows): typ = pa.list_(pa.field("item", pa.float32(), False)) return pa.Table.from_arrays([ pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ), pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ) ], ['a', 'b']) pq.write_table(make_table(100), 'test.parquet') pq.read_table('test.parquet') {code} The last line throws the following exception: {noformat} --- ArrowInvalid Traceback (most recent call last) in > 1 pq.read_table('full.parquet') ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem) 1150 return fs.read_parquet(path, columns=columns, 1151use_threads=use_threads, metadata=metadata, -> 1152use_pandas_metadata=use_pandas_metadata) 1153 1154 pf = ParquetFile(source, metadata=metadata) ~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, path, columns, metadata, schema, use_threads, use_pandas_metadata) 179 filesystem=self) 180 return dataset.read(columns=columns, use_threads=use_threads, --> 181 use_pandas_metadata=use_pandas_metadata) 182 183 def open(self, path, mode='rb'): ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, use_threads, use_pandas_metadata) 1012 table = piece.read(columns=columns, use_threads=use_threads, 1013partitions=self.partitions, -> 1014use_pandas_metadata=use_pandas_metadata) 1015 tables.append(table) 1016 ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, use_threads, partitions, open_file_func, file, use_pandas_metadata) 562 table = reader.read_row_group(self.row_group, **options) 563 else: --> 564 table = reader.read(**options) 565 566 if len(self.partition_keys) > 0: ~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, use_threads, use_pandas_metadata) 212 columns, use_pandas_metadata=use_pandas_metadata) 213 return self.reader.read_all(column_indices=column_indices, --> 214 use_threads=use_threads) 215 216 def scan_contents(self, columns=None, batch_size=65536): ~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() ~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status() ArrowInvalid: Column 1 named b expected length 932066 but got length 932063 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3936) Add _O_NOINHERIT to the file open flags on Windows
Philip Felton created ARROW-3936: Summary: Add _O_NOINHERIT to the file open flags on Windows Key: ARROW-3936 URL: https://issues.apache.org/jira/browse/ARROW-3936 Project: Apache Arrow Issue Type: Bug Reporter: Philip Felton Unlike Linux, Windows doesn't let you delete files that are currently opened by another process. So if you create a child process while a Parquet file is open, with the current code the file handle is inherited to the child process, and the parent process can't then delete the file after closing it without the child process terminating first. By default, Win32 file handles are not inheritable (likely because of the aforementioned problems). Except for _wsopen_s, which tries to maintain POSIX compatibility. This is a serious problem for us. We would argue that specifying _O_NOINHERIT by default in the _MSC_VER path is a sensible approach and would likely be the correct behaviour as it matches the main Win32 API. However, it could be that some developers rely on the current inheritable behaviour. In which case, the Arrow public API should take a boolean argument on whether the created file descriptor should be inheritable. But this would break API backward compatibility (unless a new overloaded method is introduced). Is forking and inheriting Arrow internal file descriptor something that Arrow actually means to support? What do we think of the proposed fix? -- This message was sent by Atlassian JIRA (v7.6.3#76005)