[jira] [Created] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1

2019-08-30 Thread Philip Felton (Jira)
Philip Felton created ARROW-6395:


 Summary: [pyarrow] Bug when using bool arrays with stride greater 
than 1
 Key: ARROW-6395
 URL: https://issues.apache.org/jira/browse/ARROW-6395
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Philip Felton


Here's code to reproduce it:

{code:python}
>>> import numpy as np
>>> import pyarrow as pa
>>> pa.__version__
'0.14.0'
>>> xs = np.array([True, False, False, True, True, False, True, True, True, 
>>> False, False, False, False, False, True, False, True, True, True, True, 
>>> True])
>>> xs_sliced = xs[0::2]
>>> xs_sliced
array([ True, False, True, True, True, False, False, True, True,
 True, True])
>>> pa_xs = pa.array(xs_sliced, pa.bool_())
>>> pa_xs

[
 true,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false,
 false
]{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct

2019-08-06 Thread Philip Felton (JIRA)
Philip Felton created ARROW-6149:


 Summary: [Parquet] Decimal comparisons used for min/max statistics 
are not correct
 Key: ARROW-6149
 URL: https://issues.apache.org/jira/browse/ARROW-6149
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philip Felton


The 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet 
Format specifications] says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5630) [Python] Table of nested arrays doesn't round trip

2019-06-17 Thread Philip Felton (JIRA)
Philip Felton created ARROW-5630:


 Summary: [Python] Table of nested arrays doesn't round trip
 Key: ARROW-5630
 URL: https://issues.apache.org/jira/browse/ARROW-5630
 Project: Apache Arrow
  Issue Type: Bug
 Environment: pyarrow 0.13, Windows 10
Reporter: Philip Felton


This is pyarrow 0.13 on Windows.

{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def make_table(num_rows):
typ = pa.list_(pa.field("item", pa.float32(), False))
return pa.Table.from_arrays([
pa.array([[0] * (i%10) for i in range(0, num_rows)], type=typ),
pa.array([[0] * ((i+5)%10) for i in range(0, num_rows)], type=typ)
], ['a', 'b'])

pq.write_table(make_table(100), 'test.parquet')

pq.read_table('test.parquet')
{code}

The last line throws the following exception:


{noformat}
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pq.read_table('full.parquet')

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read_table(source, columns, 
use_threads, metadata, use_pandas_metadata, memory_map, filesystem)
   1150 return fs.read_parquet(path, columns=columns,
   1151use_threads=use_threads, 
metadata=metadata,
-> 1152use_pandas_metadata=use_pandas_metadata)
   1153 
   1154 pf = ParquetFile(source, metadata=metadata)

~\Anaconda3\lib\site-packages\pyarrow\filesystem.py in read_parquet(self, path, 
columns, metadata, schema, use_threads, use_pandas_metadata)
179  filesystem=self)
180 return dataset.read(columns=columns, use_threads=use_threads,
--> 181 use_pandas_metadata=use_pandas_metadata)
182 
183 def open(self, path, mode='rb'):

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
use_threads, use_pandas_metadata)
   1012 table = piece.read(columns=columns, use_threads=use_threads,
   1013partitions=self.partitions,
-> 1014use_pandas_metadata=use_pandas_metadata)
   1015 tables.append(table)
   1016 

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
use_threads, partitions, open_file_func, file, use_pandas_metadata)
562 table = reader.read_row_group(self.row_group, **options)
563 else:
--> 564 table = reader.read(**options)
565 
566 if len(self.partition_keys) > 0:

~\Anaconda3\lib\site-packages\pyarrow\parquet.py in read(self, columns, 
use_threads, use_pandas_metadata)
212 columns, use_pandas_metadata=use_pandas_metadata)
213 return self.reader.read_all(column_indices=column_indices,
--> 214 use_threads=use_threads)
215 
216 def scan_contents(self, columns=None, batch_size=65536):

~\Anaconda3\lib\site-packages\pyarrow\_parquet.pyx in 
pyarrow._parquet.ParquetReader.read_all()

~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Column 1 named b expected length 932066 but got length 932063
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3936) Add _O_NOINHERIT to the file open flags on Windows

2018-12-04 Thread Philip Felton (JIRA)
Philip Felton created ARROW-3936:


 Summary: Add _O_NOINHERIT to the file open flags on Windows
 Key: ARROW-3936
 URL: https://issues.apache.org/jira/browse/ARROW-3936
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philip Felton


Unlike Linux, Windows doesn't let you delete files that are currently opened by 
another process. So if you create a child process while a Parquet file is open, 
with the current code the file handle is inherited to the child process, and 
the parent process can't then delete the file after closing it without the 
child process terminating first.

By default, Win32 file handles are not inheritable (likely because of the 
aforementioned problems). Except for _wsopen_s, which tries to maintain POSIX 
compatibility.

This is a serious problem for us.

We would argue that specifying _O_NOINHERIT by default in the _MSC_VER path is 
a sensible approach and would likely be the correct behaviour as it matches the 
main Win32 API.

However, it could be that some developers rely on the current inheritable 
behaviour. In which case, the Arrow public API should take a boolean argument 
on whether the created file descriptor should be inheritable. But this would 
break API backward compatibility (unless a new overloaded method is introduced).

Is forking and inheriting Arrow internal file descriptor something that Arrow 
actually means to support?

What do we think of the proposed fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)