[jira] [Created] (ARROW-9782) [C++][Dataset] Ability to write ".feather" files with IpcFileFormat

2020-08-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9782:


 Summary: [C++][Dataset] Ability to write ".feather" files with 
IpcFileFormat
 Key: ARROW-9782
 URL: https://issues.apache.org/jira/browse/ARROW-9782
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python, R
Reporter: Joris Van den Bossche


With the new dataset writing bindings, one can do {{ds.write_dataset(data, 
format="feather")}} (Python) or {{write_dataset(data, format = "feather")}} (R) 
to write a dataset to feather files. 

However, because "feather" is just an alias for the IpcFileFormat, it will 
currently write all files with the {{.ipc}} extension.   
I think this can be a bit confusing, since many people will be more familiar 
with "feather" and expect such an extension. 

(more generally, ".ipc" is maybe not the best default, since it's not very 
descriptive extension. Something like ".arrow" might be better?)

cc [~npr] [~bkietz]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9864) [Python] pathlib.Path not suppored in write_to_dataset with partition columns

2020-08-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9864:


 Summary: [Python] pathlib.Path not suppored in write_to_dataset 
with partition columns
 Key: ARROW-9864
 URL: https://issues.apache.org/jira/browse/ARROW-9864
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Copying over from https://github.com/pandas-dev/pandas/issues/35902


{code:python}
import pathlib

df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'})

df.to_parquet('tmp_path1.parquet')  # OK
df.to_parquet(pathlib.Path('tmp_path2.parquet'))  # OK

df.to_parquet('tmp_path3.parquet', partition_cols=['B'])  # OK
df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # 
TypeError
{code}

{{to_parquet}} method raises TypeError when using {{pathlib.Path()}} as an 
argument in case when `partition_cols` argument is not None. If no partition 
cols are provided, then {{pathlib.Path()}} is properly accepted

{code}
---
TypeError Traceback (most recent call last)
 in 
  3 
  4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK
> 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  
# TypeError
...

~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in 
write_to_dataset(table, root_path, partition_cols, partition_filename_cb, 
filesystem, **kwargs)
   1790 subtable = pa.Table.from_pandas(subgroup, schema=subschema,
   1791 safe=False)
-> 1792 _mkdir_if_not_exists(fs, '/'.join([root_path, subdir]))
   1793 if partition_filename_cb:
   1794 outfile = partition_filename_cb(keys)

TypeError: sequence item 0: expected str instance, PosixPath found
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9875) [Python] Let FileSystem.get_file_info accept a single path

2020-08-27 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9875:


 Summary: [Python] Let FileSystem.get_file_info accept a single path
 Key: ARROW-9875
 URL: https://issues.apache.org/jira/browse/ARROW-9875
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently you need to do {{fs.get_file_info([path])[0]}} to get the info of a 
single path. We can make the function also accept that directly (instead of a 
list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9893) [Python] Bindings for writing datasets to Parquet

2020-09-01 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9893:


 Summary: [Python] Bindings for writing datasets to Parquet
 Key: ARROW-9893
 URL: https://issues.apache.org/jira/browse/ARROW-9893
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)

2020-09-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9906:


 Summary: [Python] Crash in 
test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from 
S3FileSystem)
 Key: ARROW-9906
 URL: https://issues.apache.org/jira/browse/ARROW-9906
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array

2020-09-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9920:


 Summary: [Python] pyarrow.concat_arrays segfaults when passing it 
a chunked array
 Key: ARROW-9920
 URL: https://issues.apache.org/jira/browse/ARROW-9920
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it 
the list of chunks:

{code}
In [1]: arr = pa.chunked_array([[0, 1], [3, 4]])

In [2]: pa.concat_arrays(arr.chunks)
Out[2]: 

[
  0,
  1,
  3,
  4
]
{code}

but if passing the chunked array itself, you get a segfault:

{code}
In [4]: pa.concat_arrays(arr)
Segmentation fault (core dumped)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet

2020-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9936:


 Summary: [Python] Fix / test relative file paths in pyarrow.parquet
 Key: ARROW-9936
 URL: https://issues.apache.org/jira/browse/ARROW-9936
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 2.0.0


It seems that I broke writing parquet to relative file paths in ARROW-9718 
(again, something similar happened in the pyarrow.dataset reading), so should 
fix that and properly test this.

{code}
In [3]: pq.write_table(table, "test_relative.parquet")
...
~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri()

ArrowInvalid: URI has empty scheme: 'test_relative.parquet'
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9938:


 Summary: [Python] Add filesystem capabilities to other IO formats 
(feather, csv, json, ..)?
 Key: ARROW-9938
 URL: https://issues.apache.org/jira/browse/ARROW-9938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files. So for those, you need to do the more manual (I _suppose_ this 
works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
  table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9952) [Python] Use pyarrow.dataset writing for pq.write_to_dataset

2020-09-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9952:


 Summary: [Python] Use pyarrow.dataset writing for 
pq.write_to_dataset
 Key: ARROW-9952
 URL: https://issues.apache.org/jira/browse/ARROW-9952
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


Now ARROW-9658 and ARROW-9893 are in, we can explore using the 
{{pyarrow.dataset}} writing capabilities in {{parquet.write_to_dataset}}.

Similarly as was done in {{pq.read_table}}, we could initially have a keyword 
to switch between both implementations, eventually defaulting to the new 
datasets one, and to deprecated the old (inefficient) python implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9962) [Python] Conversion to pandas with index column using fixed timezone fails

2020-09-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9962:


 Summary: [Python] Conversion to pandas with index column using 
fixed timezone fails
 Key: ARROW-9962
 URL: https://issues.apache.org/jira/browse/ARROW-9962
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/pandas-dev/pandas/issues/35997: it seems we are 
>handling a normal column and index column differently in the conversion to 
>pandas.

{code}
In [5]: import pandas as pd
   ...: from datetime import datetime, timezone
   ...: 
   ...: df = pd.DataFrame([[datetime.now(timezone.utc), 
datetime.now(timezone.utc)]], columns=['date_index', 'date_column'])
   ...: table = pa.Table.from_pandas(df.set_index('date_index'))
   ...: 

In [6]: table
Out[6]: 
pyarrow.Table
date_column: timestamp[ns, tz=+00:00]
date_index: timestamp[ns, tz=+00:00]

In [7]: table.to_pandas()
...
UnknownTimeZoneError: '+00:00'
{code}

So this happens specifically for "fixed offset" timezones, and only for index 
columns (eg {{table.select(["date_column"]).to_pandas()}} works fine).

It seems this is because for columns we use our helper {{make_tz_aware}} to 
convert the string "+01:00" to a python timezone, which is then understood by 
pandas (the string is not handled by pandas). But for the index column we fail 
to do this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9963) [Python] Recognize datetime.timezone.utc as UTC on conversion python->pyarrow

2020-09-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9963:


 Summary: [Python] Recognize datetime.timezone.utc as UTC on 
conversion python->pyarrow
 Key: ARROW-9963
 URL: https://issues.apache.org/jira/browse/ARROW-9963
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Related to ARROW-5248, but specifically for the stdlib 
{{datetime.timezone.utc}}, I think it would be nice to "recognize" this as UTC. 
Currently it is converted to "+00:00", while for pytz this is not the case:

{code}
from datetime import datetime, timezone
import pytz

print(pa.array([datetime.now(timezone.utc)]).type)
print(pa.array([datetime.now(pytz.utc)]).type)
{code}

gives

{code}
timestamp[us, tz=+00:00]
timestamp[us, tz=UTC]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10091) [C++][Dataset] Support isin filter for row group (statistics-based) filtering

2020-09-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10091:
-

 Summary: [C++][Dataset] Support isin filter for row group 
(statistics-based) filtering
 Key: ARROW-10091
 URL: https://issues.apache.org/jira/browse/ARROW-10091
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently the {{isin}} filter works for partition-based filtering, but not for 
row group (statistics)-based filtering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10099) [C++][Dataset] Also allow integer partition fields to be dictionary encoded

2020-09-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10099:
-

 Summary: [C++][Dataset] Also allow integer partition fields to be 
dictionary encoded
 Key: ARROW-10099
 URL: https://issues.apache.org/jira/browse/ARROW-10099
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


In ARROW-8647, we added the option to indicate that you partition field columns 
should be dictionary encoded, but it currently does only do this for string 
type, and not for integer type (wiht the reasoning that for integers, it is not 
giving any memory efficiency gains to use dictionary encoding). 

In dask, they have been using categorical dtypes for _all_ partition fields, 
also if they are integers. They would like to keep doing this (apart from 
memory efficiency, using categorical/dictionary type also gives information 
about all uniques values of the column, without having to calculate this), so 
it would be nice to enable this use case. 

So I think we could either simply always dictionary encode also integers when 
{{max_partition_dictionary_size}} indicates partition fields should be 
dictionary encoded, or either have an additional option to indicate also 
integer partition fields should be encoded (if the other option indicates 
dictionary encoding should be used).

cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10100) [C++]

2020-09-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10100:
-

 Summary: [C++]
 Key: ARROW-10100
 URL: https://issues.apache.org/jira/browse/ARROW-10100
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10130) [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status

2020-09-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10130:
-

 Summary: [C++][Dataset] ParquetFileFragment::SplitByRowGroup does 
not preserve "complete_metadata" status
 Key: ARROW-10130
 URL: https://issues.apache.org/jira/browse/ARROW-10130
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


Splitting a ParquetFileFragment in  multiple fragments per row group 
({{SplitByRowGroup}}) calls {{EnsureCompleteMetadata}} initially, but then the 
created fragments afterwards don't have the {{has_complete_metadata_}} property 
set. This means that when calling {{EnsureCompleteMetadata}} on the splitted 
fragments, it will read/parse the metadata again, instead of using the cached 
ones (which are already present).

Small example to illustrate:

{code:python}
In [1]: import pyarrow.dataset as ds

In [2]: dataset = 
ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", 
partitioning="hive")

In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in 
frag.split_by_row_group()]

In [4]: len(rg_fragments)
Out[4]: 4520

# row group fragments actually have statistics
In [7]: rg_fragments[0].row_groups[0].statistics
Out[7]: 
{'vendor_id': {'min': '1', 'max': '4'},
 'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51),
  'max': datetime.datetime(2018, 12, 26, 14, 48, 54)},
...

# but calling ensure_complete_metadata still takes a lot of time the first call
In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s
Wall time: 1.9 s

In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms
Wall time: 1.35 ms
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10131) [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment

2020-09-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10131:
-

 Summary: [C++][Dataset] Lazily parse parquet metadata / statistics 
in ParquetDatasetFactory and ParquetFileFragment
 Key: ARROW-10131
 URL: https://issues.apache.org/jira/browse/ARROW-10131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Related to ARROW-9730, parsing of the statistics in parquet metadata is 
expensive, and therefore should be avoided when possible.

For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in python) 
parses all statistics of all files and all columns. While when doing a filtered 
read, you might only need the statistics of certain files (eg if a filter on a 
partition field already excluded many files) and certain columns (eg only the 
columns on which you are actually filtering).

The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later 
EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, 
or only parsing the other (non-statistics) metadata, ...), so I think we should 
try to think of better abstractions.

cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10134) [C++][Dataset] Add ParquetFileFragment::num_row_groups property

2020-09-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10134:
-

 Summary: [C++][Dataset] Add ParquetFileFragment::num_row_groups 
property
 Key: ARROW-10134
 URL: https://issues.apache.org/jira/browse/ARROW-10134
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


>From https://github.com/dask/dask/pull/6534#issuecomment-699512602, comment by 
>[~rjzamora]:

bq.  it would be great to have access the total row-group count for the 
fragment from a {{num_row_groups}} attribute (which pyarrow should be able to 
get without parsing all row-group metadata/statistics - I think?).

One question is: does this attribute correspond to the row groups in the 
parquet file, or the (subset of) row groups represented by the fragment? 
I expect the second (so if you do SplitByRowGroup, you would get a fragment 
with num_row_groups==1), but this might be a potential confusing aspect of the 
attribute.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

2020-09-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10145:
-

 Summary: [C++][Dataset] Integer-like partition field values 
outside int32 range error on reading
 Key: ARROW-10145
 URL: https://issues.apache.org/jira/browse/ARROW-10145
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset

Small reproducer:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'part': [3760212050]*10, 'col': range(10)})
pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])

In [35]: pq.read_table("test_int64_partition/")
...
ArrowInvalid: error parsing '3760212050' as scalar of type int32
In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
In ../src/arrow/dataset/partition.cc, line 218, code: 
(_error_or_value26).status()
In ../src/arrow/dataset/partition.cc, line 229, code: 
(_error_or_value27).status()
In ../src/arrow/dataset/discovery.cc, line 256, code: 
(_error_or_value17).status()

In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
Out[36]: 
pyarrow.Table
col: int64
part: dictionary
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset

2020-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10244:
-

 Summary: [Python][Docs] Add docs on using 
pyarrow.dataset.parquet_dataset
 Key: ARROW-10244
 URL: https://issues.apache.org/jira/browse/ARROW-10244
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2020-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10247:
-

 Summary: [C++][Dataset] Cannot write dataset with dictionary 
column as partition field
 Key: ARROW-10247
 URL: https://issues.apache.org/jira/browse/ARROW-10247
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


When the column to use for partitioning is dictionary encoded, we get this 
error:

{code}
In [9]: import pyarrow.dataset as ds

In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
...: table = pa.table([
...: pa.array(range(len(part))),
...: pa.array(part).dictionary_encode(),
...: ], names=['col', 'part'])

In [11]: part = ds.partitioning(table.select(["part"]).schema)

In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
partitioning=part)
---
ArrowTypeErrorTraceback (most recent call last)
 in 
> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
partitioning=part)

~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
basename_template, format, partitioning, schema, filesystem, file_options, 
use_threads)
773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
--> 775 filesystem, partitioning, file_options, use_threads,
776 )

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: scalar xxx (of type string) is invalid for part: 
dictionary
In ../src/arrow/dataset/filter.cc, line 1082, code: 
VisitConjunctionMembers(*and_.left_operand(), visitor)
In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, [&](const 
std::string& name, const std::shared_ptr& value) { auto&& 
_error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
::arrow::Status __s = 
::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
_st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
"(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
auto& field = schema_->field(match[0]); if 
(!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
value->ToString(), " (of type ", *value->type, ") is invalid for ", 
field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); })
In ../src/arrow/dataset/file_base.cc, line 321, code: 
(_error_or_value24).status()
In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
{code}

While this seems a quit normal use case, as this column will typically be 
repeated many times (and we also support reading it as such with dictionary 
type, so a roundtrip is currently not possible in that case)

I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata

2020-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10248:
-

 Summary: [C++][Dataset] Dataset writing does not write schema 
metadata
 Key: ARROW-10248
 URL: https://issues.apache.org/jira/browse/ARROW-10248
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


Not sure if this is related to the writing refactor that landed yesterday, but 
`write_dataset` does not preserve the schema metadata (eg used for pandas 
metadata):

{code}
In [20]: df = pd.DataFrame({'a': [1, 2, 3]})

In [21]: table = pa.Table.from_pandas(df)

In [22]: table.schema
Out[22]: 
a: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 396

In [23]: ds.write_dataset(table, "test_write_dataset_pandas", format="parquet")

In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema
Out[24]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
yet look into how easy it would be to fix.

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10264) [C++][Python] Parquet test failing with HadoopFileSystem URI

2020-10-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10264:
-

 Summary: [C++][Python] Parquet test failing with HadoopFileSystem 
URI
 Key: ARROW-10264
 URL: https://issues.apache.org/jira/browse/ARROW-10264
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche
 Fix For: 3.0.0


Follow-up on ARROW-10175. In the HDFS integration tests, there is a test using 
a URI failing if we use the new filesystem / dataset implementation:

{code}
FAILED 
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_multiple_parquet_files_with_uri
{code}

fails with

{code}
pyarrow.lib.ArrowInvalid: Path 
'/tmp/pyarrow-test-838/multi-parquet-uri-48569714efc74397816722c9c6723191/0.parquet'
 is not relative to '/user/root'
{code}

while it is passing a URI (and not a filesystem object) to 
{{parquet.read_table}}, and the new filesystems/dataset implementation should 
be able to handle URIs.

cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10281) [Python] Fix warning when running tests

2020-10-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10281:
-

 Summary: [Python] Fix warning when running tests
 Key: ARROW-10281
 URL: https://issues.apache.org/jira/browse/ARROW-10281
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


We have accumulated quite some warnings



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10282) [Python] Conversion from custom types (eg decimal) to int dtype raises warning

2020-10-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10282:
-

 Summary: [Python] Conversion from custom types (eg decimal) to int 
dtype raises warning
 Key: ARROW-10282
 URL: https://issues.apache.org/jira/browse/ARROW-10282
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


{code:python}
In [2]: import decimal

In [3]: pa.array([decimal.Decimal("123456")], pa.int32())
DeprecationWarning: an integer is required (got type decimal.Decimal).  
Implicit conversion to integers using __int__ is deprecated, and may be removed 
in a future version of Python.

Out[3]: 

[
  123456,
]
{code}

cc [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10283) [Python] Python deprecation warning for "PY_SSIZE_T_CLEAN will be required for '#' formats"

2020-10-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10283:
-

 Summary: [Python] Python deprecation warning for "PY_SSIZE_T_CLEAN 
will be required for '#' formats"
 Key: ARROW-10283
 URL: https://issues.apache.org/jira/browse/ARROW-10283
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 3.0.0


We have a few cases that run into this python deprecation warning:

{code}
pyarrow/tests/test_pandas.py: 9 warnings
pyarrow/tests/test_parquet.py: 7790 warnings
  sys:1: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
pyarrow/tests/test_pandas.py::TestConvertDecimalTypes::test_decimal_with_None_explicit_type
pyarrow/tests/test_pandas.py::TestConvertDecimalTypes::test_decimal_with_None_infer_type
  /buildbot/AMD64_Conda_Python_3_8/python/pyarrow/tests/test_pandas.py:114: 
DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
result = pd.Series(arr.to_pandas(), name=s.name)
pyarrow/tests/test_pandas.py::TestConvertDecimalTypes::test_strided_objects
  /buildbot/AMD64_Conda_Python_3_8/python/pyarrow/pandas_compat.py:1127: 
DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
result = pa.lib.table_to_blocks(options, block_table, categories,
{code}

Related to https://bugs.python.org/issue36381

I think one such usage example is at 
https://github.com/apache/arrow/blob/0b481523b7502a984788d93b822a335894ffe648/cpp/src/arrow/python/decimal.cc#L106



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10284) [Python] Pyarrow is raising deprecation warning about filesystems on import

2020-10-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10284:
-

 Summary: [Python] Pyarrow is raising deprecation warning about 
filesystems on import
 Key: ARROW-10284
 URL: https://issues.apache.org/jira/browse/ARROW-10284
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


This happens on import (when setting the warning to be visisble), so even when 
the user doesn't use the deprecated filesystems:

{code}
In [1]: import warnings

In [2]: warnings.simplefilter("always")

In [3]: import pyarrow
/home/joris/scipy/repos/arrow/python/pyarrow/filesystem.py:255: 
DeprecationWarning: pyarrow.filesystem.LocalFileSystem is deprecated as of 
2.0.0, please use pyarrow.fs.LocalFileSystem instead.
  cls._instance = LocalFileSystem()
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10285) [Python] pyarrow.orc submodule is using deprecated functionality

2020-10-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10285:
-

 Summary: [Python] pyarrow.orc submodule is using deprecated 
functionality
 Key: ARROW-10285
 URL: https://issues.apache.org/jira/browse/ARROW-10285
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10347) [Python][Dataset] Test behaviour in case of duplicate partition field / data column

2020-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10347:
-

 Summary: [Python][Dataset] Test behaviour in case of duplicate 
partition field / data column
 Key: ARROW-10347
 URL: https://issues.apache.org/jira/browse/ARROW-10347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10423) [C++] Filter compute function seems slow compared to numpy nonzero + take

2020-10-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10423:
-

 Summary: [C++] Filter compute function seems slow compared to 
numpy nonzero + take
 Key: ARROW-10423
 URL: https://issues.apache.org/jira/browse/ARROW-10423
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/64581590/is-there-a-more-efficient-way-to-select-rows-from-a-pyarrow-table-based-on-conte

I made a smaller, simplified example:

{code:python}
arr = pa.array(np.random.randn(1_000_000))

# mask with only few True values
mask1 = np.zeros(len(arr), dtype=bool)
mask1[np.random.randint(len(arr), size=100)] = True
mask1_pa = pa.array(mask1)

# mask with larger proportion of True values
mask2 = np.zeros(len(arr), dtype=bool)
mask2[np.random.randint(len(arr), size=10_000)] = True
mask2_pa = pa.array(mask2)
{code}

Doing timings of doing a Arrow {{Filter}} kernel vs using numpy to convert the 
mask into indices and then using a {{Take}} kernel:

{code}
# mask 1
In [3]: %timeit arr.filter(mask1_pa)
132 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)

In [4]: %%timeit
   ...: indices = np.nonzero(mask1)[0]
   ...: arr.take(indices)
114 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)

# mask 2
In [8]: %timeit arr.filter(mask2_pa)
711 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %%timeit
   ...: indices = np.nonzero(mask2)[0]
   ...: arr.take(indices)
333 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
{code}

So in the first case, both are quite similar in timing. But in the second case, 
the numpy+take version is faster. 

I know this might depend on a lot on the actual proportion of True values and 
how they are positioned in the array (random vs concentrated) etc, so there is 
probably not a general rule of what should be faster. 
But, it still seems a potential indication that things can be optimized in the 
Filter kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10425) [Python] Support reading (compressed) CSV file from remote file / binary blob

2020-10-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10425:
-

 Summary: [Python] Support reading (compressed) CSV file from 
remote file / binary blob
 Key: ARROW-10425
 URL: https://issues.apache.org/jira/browse/ARROW-10425
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/64588076/how-can-i-read-a-csv-gz-file-with-pyarrow-from-a-file-object

Currently {{pyarrow.csv.rad_csv}} happily takes a path to a compressed file and 
automatically decompresses it, but AFAIK this only works for local paths. 

It would be nice to in general support reading CSV from remote files (with URI 
/ specifying a filesystem), and in that case also support compression. 

In addition we could also read a compressed file from a BytesIO / file-like 
object, but not sure we want that (as it would required a keyword to indicate 
the used compression).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10432) [C++] CSV reader: support for multi-character / whitespace delimiter?

2020-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10432:
-

 Summary: [C++] CSV reader: support for multi-character / 
whitespace delimiter?
 Key: ARROW-10432
 URL: https://issues.apache.org/jira/browse/ARROW-10432
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


I don't know how useful general "multi-character" delimiter support is, but one 
specific type of it that seems useful is "whitespace delimited", meaning any 
whitespace (possibly multiple / different whitespace characters). 

In pandas you can achieve this either by passing {{delimiter="\s+"}} or 
specifying {{delim_whitespace=True}} (and both are equivalent, pandas special 
cases {{delimiter="\s+"}} as any other multi-character delimiter is interpreted 
as an actual regex and triggers the slower python engine intead of using the 
default c engine)

cc [~apitrou] [~npr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10462) [Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows

2020-11-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10462:
-

 Summary: [Python] ParquetDatasetPiece's path broken when using 
fsspec fs on Windows
 Key: ARROW-10462
 URL: https://issues.apache.org/jira/browse/ARROW-10462
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.1


Dask reported some failures starting with the pyarrow 2.0 release, and 
specifically on Windows: https://github.com/dask/dask/issues/6754

After some investigation, it seems that this is due to the 
{{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a 
mixture of {{\\}} and {/}} in it. 

It specifically happens when dask is passing a posix-style base path pointing 
to the dataset base directory (so using all {{/}}), and passing an fsspec-based 
(local) filesystem.  
>From a debugging output during one of the dask tests:

{code}
(Pdb) dataset

(Pdb) dataset.paths
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
(Pdb) dataset.pieces[0].path
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
{code}

So you can see that the result here has a mix of {{\\}} and {{/}}. Using 
pyarrow 1.0, this was consistently using {{/}}.

The reason for the change is that in pyarrow 2.0 we started to replace fsspec 
LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem 
that should be equivalent). But it seems that our own LocalFileSystem has a 
{{pathsep}}} property that equals to {{os.path.sep}}, which is {{\\}} on 
Windows 
(https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.

So note that while this started being broken in pyarrow 2.0 when using fsspec 
filesystem, this was already "broken" before when using our own local 
filesystem (or when not passing any filesystem). But, 1) dask always passes an 
fsspec filesystem, and 2) dask uses the piece's path as dictionary key and is 
thus especially sensitive to the change (using it as a file path to read 
something in, it will probably still work even with the mixture of path 
separators).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10469) [CI][Python] Run dask integration tests on Windows

2020-11-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10469:
-

 Summary: [CI][Python] Run dask integration tests on Windows
 Key: ARROW-10469
 URL: https://issues.apache.org/jira/browse/ARROW-10469
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


So we can catch bugs like ARROW-10462 in advance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10471) [CI][Python] Ensure we have a test build with s3fs

2020-11-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10471:
-

 Summary: [CI][Python] Ensure we have a test build with s3fs
 Key: ARROW-10471
 URL: https://issues.apache.org/jira/browse/ARROW-10471
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10473) [Python] FSSpecHandler get_file_info with recursive selector not working with s3fs

2020-11-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10473:
-

 Summary: [Python] FSSpecHandler get_file_info with recursive 
selector not working with s3fs
 Key: ARROW-10473
 URL: https://issues.apache.org/jira/browse/ARROW-10473
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.1


The partitioned ParquetDataset tests are failing when using s3fs filesystem (I 
am adding tests in https://github.com/apache/arrow/pull/8573). 
I need to come up with a more minimal test isolating the 
FileSystem.get_file_info behaviour, but from debugging the parquet tests it 
seems that it is only listing the first level (and not further nested 
directories/files).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10482) [Python] Specifying compression type on a column basis when writing Parquet not working

2020-11-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10482:
-

 Summary: [Python] Specifying compression type on a column basis 
when writing Parquet not working
 Key: ARROW-10482
 URL: https://issues.apache.org/jira/browse/ARROW-10482
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/64666270/using-per-column-compression-codec-in-parquet-write-table

According to the docs, you can specify the compression type on a 
column-by-column basis, but that doesn't seem to be working:

{code}
In [5]: table = pa.table([[1, 2], [3, 4], [5, 6]], names=["foo", "bar", "baz"])

In [6]: pq.write_table(table, 'test1.parquet', 
compression=dict(foo='zstd',bar='snappy',baz='brotli'))
...
~/scipy/repos/arrow/python/pyarrow/_parquet.cpython-37m-x86_64-linux-gnu.so in 
string.from_py.__pyx_convert_string_from_py_std__in_string()

TypeError: expected bytes, str found
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10546) [Python] Deprecate the S3FSWrapper class

2020-11-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10546:
-

 Summary: [Python] Deprecate the S3FSWrapper class
 Key: ARROW-10546
 URL: https://issues.apache.org/jira/browse/ARROW-10546
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-10433 / discussion at 
https://github.com/apache/arrow/pull/8557#issuecomment-724225124

The {{S3FSWrapper}} class has been used in the past to wrap s3fs filesystems, 
before fsspec subclassed {{pyarrow.filesystem}} filesystems. This is however 
already more than 2 years ago, and AFAIK nobody should still be using 
{{S3FSWrapper}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10558) [Python] Filesystem S3 tests not independent (native s3 influences s3fs)

2020-11-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10558:
-

 Summary: [Python] Filesystem S3 tests not independent (native s3 
influences s3fs)
 Key: ARROW-10558
 URL: https://issues.apache.org/jira/browse/ARROW-10558
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The filesystem tests in {{test_fs.py}} that are parametrized with all the 
tested filesystems have some "state" shared between them, at least in the case 
of S3. 

When first a test is run with our own S3FileSystem, which eg creates a 
directory, this directory is still present when we test the s3fs wrapped 
filesystem, which causes some tests to pass that would otherwise fail if run in 
isolation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10578) [C++] Comparison kernels crashing for string array with null string scalar

2020-11-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10578:
-

 Summary: [C++] Comparison kernels crashing for string array with 
null string scalar
 Key: ARROW-10578
 URL: https://issues.apache.org/jira/browse/ARROW-10578
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Comparing a string array with a string scalar works:

{code}
In [1]: import pyarrow.compute as pc

In [2]: pc.equal(pa.array(["a", None, "b"]), pa.scalar("a", type="string")
Out[2]: 

[
  true,
  null,
  false
]
{code}

but if the scalar is a null (from the proper string type), it crashes:

{code}
In [4]: pc.equal(pa.array(["a", None, "b"]), pa.scalar(None, type="string"))
Segmentation fault (core dumped)
{code}

(and not even debug messages ..)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10640) [C++] A "where" kernel to combine two arrays based on a mask

2020-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10640:
-

 Summary: [C++] A "where" kernel to combine two arrays based on a 
mask
 Key: ARROW-10640
 URL: https://issues.apache.org/jira/browse/ARROW-10640
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Joris Van den Bossche


(from discussion in ARROW-9489 with [~maartenbreddels])

A general "where" kernel like {{np.where}} 
(https://numpy.org/doc/stable/reference/generated/numpy.where.html) seems a 
generally useful kernel to have, and could also help mimicking some other 
python (setitem-like) operations. 

The concrete use case in ARROW-9489 is to basically do a 
{{fill_null(array[string], array[string])}} which could be expressed as 
{{where(is_null(arr), arr2, arr)}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10641) [C++] A "replace" or "map" kernel to replace values in array based on mapping

2020-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10641:
-

 Summary: [C++] A "replace" or "map" kernel to replace values in 
array based on mapping
 Key: ARROW-10641
 URL: https://issues.apache.org/jira/browse/ARROW-10641
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Joris Van den Bossche


A "replace" or "map" kernel to replace values in array based on mapping. This 
would be similar as the pandas {{Series.replace}} (or {{Series.map}}) kernel, 
and as a small illustration of what is meant:

{code}
In [41]: s = pd.Series(["Yes", "Y", "No", "N"])

In [42]: s
Out[42]: 
0Yes
1  Y
2 No
3  N
dtype: object

In [43]: s.replace({"Y": "Yes", "N": "No"})
Out[43]: 
0Yes
1Yes
2 No
3 No
dtype: object

{code}

Note: in pandas the difference between "replace" and "map" is that replace will 
only replace a value if it is present in the mapping, while map will replace 
every value in the input array with the corresponding value in the mapping and 
return null if not present in the mapping.

Note, this is different from ARROW-10306 which is about string replacement 
_within_ array elements (replacing a substring in each string element in the 
array), while here it is about replacing full elements of the array)

cc [~maartenbreddels]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10643) [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe

2020-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10643:
-

 Summary: [Python] Pandas<->pyarrow roundtrip failing to recreate 
index for empty dataframe
 Key: ARROW-10643
 URL: https://issues.apache.org/jira/browse/ARROW-10643
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/pandas-dev/pandas/issues/37897

The roundtrip of an empty pandas.DataFrame _with_ and index (so no columns, but 
a non-zero shape for the rows) isn't faithful:

{code}
In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))

In [34]: df
Out[34]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [35]: df.shape
Out[35]: (10, 0)

In [36]: table = pa.table(df)

In [37]: table.to_pandas()
Out[37]: 
Empty DataFrame
Columns: []
Index: []

In [38]: table.to_pandas().shape
Out[38]: (0, 0)
{code}

Since the pandas metadata in the Table actually have this RangeIndex 
information:

{code}
In [39]: table.schema.pandas_metadata
Out[39]: 
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 10,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'empty',
   'numpy_type': 'object',
   'metadata': None}],
 'columns': [],
 'creator': {'library': 'pyarrow', 'version': '3.0.0.dev162+g305160495'},
 'pandas_version': '1.2.0.dev0+1225.g91f5bfcdc4'}
{code}

we should in principle be able to correctly roundtrip this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10644) [Python] Consolidate path/filesystem handling in pyarrow.dataset and pyarrow.fs

2020-11-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10644:
-

 Summary: [Python] Consolidate path/filesystem handling in 
pyarrow.dataset and pyarrow.fs
 Key: ARROW-10644
 URL: https://issues.apache.org/jira/browse/ARROW-10644
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The {{pyarrow.dataset}} module grew some custom code to deal with paths and 
filesystems, but also the{{pyarrow.fs}} package has some general utilities for 
this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10663) [C++/Doc] The IsIn kernel ignores the skip_nulls option of SetLookupOptions

2020-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10663:
-

 Summary: [C++/Doc] The IsIn kernel ignores the skip_nulls option 
of SetLookupOptions
 Key: ARROW-10663
 URL: https://issues.apache.org/jira/browse/ARROW-10663
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 3.0.0


The C++ docs of {{SetLookupOptions}} has this explanation of the {{skip_nulls}} 
option:

{code}
  /// Whether nulls in `value_set` count for lookup.
  ///
  /// If true, any null in `value_set` is ignored and nulls in the input
  /// produce null (IndexIn) or false (IsIn) values in the output.
  /// If false, any null in `value_set` is successfully matched in
  /// the input.
  bool skip_nulls;
{code}

(from 
https://github.com/apache/arrow/blob/8b9f6b9d28b4524724e60fac589fb1a3552a32b4/cpp/src/arrow/compute/api_scalar.h#L78-L84)

However, for {{IsIn}} this explanation doesn't seem to hold in practice:

{code}
In [16]: arr = pa.array([1, 2, None])

In [17]: pc.is_in(arr, value_set=pa.array([1, None]), skip_null=True)
Out[17]: 

[
  true,
  false,
  true
]

In [18]: pc.is_in(arr, value_set=pa.array([1, None]), skip_null=False)
Out[18]: 

[
  true,
  false,
  true
]
{code}

This documentation was added in https://github.com/apache/arrow/pull/7695 
(ARROW-8989)/
.

BTW, for "index_in", it works as documented:

{code}
In [19]: pc.index_in(arr, value_set=pa.array([1, None]), skip_null=True)
Out[19]: 

[
  0,
  null,
  null
]

In [20]: pc.index_in(arr, value_set=pa.array([1, None]), skip_null=False)
Out[20]: 

[
  0,
  null,
  1
]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2020-11-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10695:
-

 Summary: [C++][Dataset] Allow to use a UUID in the 
basename_template when writing a dataset
 Key: ARROW-10695
 URL: https://issues.apache.org/jira/browse/ARROW-10695
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently we allow the user to specify a {{basename_template}}, and this can 
include a {{"\{i\}"}} part to replace it with an automatically incremented 
integer (so each generated file written to a single partition is unique):

https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717

It _might_ be useful to also have the ability to use a UUID, to ensure the file 
is unique in general (not only for a single write) and to mimic the behaviour 
of the old {{write_to_dataset}} implementation.

For example, we could look for a {{"\{uuid\}"}} in the template string, and if 
present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10726) [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data

2020-11-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10726:
-

 Summary: [Python] Reading multiple parquet files with different 
index column dtype (originating pandas) reads wrong data
 Key: ARROW-10726
 URL: https://issues.apache.org/jira/browse/ARROW-10726
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 3.0.0


See https://github.com/pandas-dev/pandas/issues/38058



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10805) [C++] CSV reader: option to ignore trailing delimiters

2020-12-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10805:
-

 Summary: [C++] CSV reader: option to ignore trailing delimiters
 Key: ARROW-10805
 URL: https://issues.apache.org/jira/browse/ARROW-10805
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


It is not uncommon to have a CSV file that has "trailing" delimiters. 

For example, I ran into something like this:

{code}
1|2|3|
4|5|6|
{code}

where we currently detect 4 columns. If you want to properly read this in while 
passing the column names, you need to add a "dummy" column name for the 
non-existing last column (and specify the actual column names to 
{{include_columns}} to drop it again):

{code:python}
column_names = [...]

csv.read_csv(
"path/to/dile.csv",
read_options=csv.ReadOptions(column_names=column_names + ["dummy"]),
parse_options=csv.ParseOptions(delimiter="|"),
convert_options=csv.ConvertOptions(include_columns=column_names)
)
{code}

Pandas has indirect support for it through the {{index_col=False}} option (see 
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#index-columns-and-trailing-delimiters,
 i.e. when the length of the names is 1 shorter as the detected number of 
columns and this last column is all empty, it will drop this)

Although the above provides a workaround, it might be nice to have out of the 
box support for it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10845) [Python][CI] Add python CI build using numpy nightly

2020-12-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10845:
-

 Summary: [Python][CI] Add python CI build using numpy nightly
 Key: ARROW-10845
 URL: https://issues.apache.org/jira/browse/ARROW-10845
 Project: Apache Arrow
  Issue Type: Improvement
  Components: CI, Python
Reporter: Joris Van den Bossche
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10849) [Python] Handle numpy deprecation warnings for builtin type aliases

2020-12-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10849:
-

 Summary: [Python] Handle numpy deprecation warnings for builtin 
type aliases
 Key: ARROW-10849
 URL: https://issues.apache.org/jira/browse/ARROW-10849
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


See 
https://numpy.org/devdocs/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata

2019-11-05 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967438#comment-16967438
 ] 

Joris Van den Bossche commented on ARROW-7063:
--

I also ran into this recently when looking at the reports involving a huge 
number of columns (although that was in Python, and I see that we don't use the 
exact same code as the C++ pretty printer: 
https://github.com/apache/arrow/blob/e0cc9c43276840579a29332aca7348bbc415c051/python/pyarrow/types.pxi#L1245-L1264).
 

We should probably at least truncate the metadata. Personally I would prefer 
truncating them (so they don't get annoying) instead of not showing them at 
all, as IMO it is useful to see that the table has metadata.  
We could for example truncate each entry to a max of 50 characters (adding 
{{...}}) while still showing all entries (all keys).

{quote}And IDK what to do with this {{ARROW:schema: }} business but it's 
clearly not readable as is.{quote}

It's a the original arrow schema in serialized format. Example with python how 
it is created when writing a parquet file, and retrieving it again:

{code}
In [33]: import pyarrow as pa   

   

In [34]: table = pa.table(pd.DataFrame({'a': [1, 2, 3]}))   

   

In [35]: table  

   
Out[35]: 
pyarrow.Table
a: int64
metadata

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
b'.g157495696.dirty"}'}

In [36]: import pyarrow.parquet as pq   

   

In [37]: pq.write_table(table, 'test.parquet')  

   

In [39]: schema = pq.read_schema('test.parquet')

   

In [40]: schema 

   
Out[40]: 
a: int64
metadata

{b'ARROW:schema': b'/4ACAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwA'
  b'AAAEAAgACggCAAAEAQwIAAwABAAIAAgI'
  b'EAYAAABwYW5kYXMAANMBAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJr'
  b'aW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAi'
  b'c3RvcCI6IDMsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBb'
  b'eyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFz'
  b'X3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIs'
  b'ICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29s'
  b'dW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAi'
  b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'
  b'NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJh'
  b'cnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjAuMTUuMS5kZXYyMTIr'
  b'ZzRhZmU5ZjBlYSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMC4yNi4wLmRl'
  b'djArNjkxLmcxNTc0OTU2OTYuZGlydHkifQABFBAAFAAIAAYA'
  b'BwAMEAAQAAABAiQUBAAIAAwACAAHAAgA'
  b'AAABQAEAAABh',
 b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type":

[jira] [Created] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?

2019-11-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7066:


 Summary: [Python] support returning ChunkedArray from 
__arrow_array__ ?
 Key: ARROW-7066
 URL: https://issues.apache.org/jira/browse/ARROW-7066
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can 
define how they should be converted to a pyarrow Array (similar to numpy's 
{{\_\_array\_\_}}). This is then also used to support converting pandas 
DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if 
the pandas ExtensionArray, such as nullable integer type, implements this 
{{\_\_arrow_array\_\_}} method).

This last use case could also be useful for fletcher 
(https://github.com/xhochy/fletcher/, a package that implements pandas 
ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a 
pandas DataFrame).  
However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a 
pandas DataFrame (to have a better mapping with a Table, where the columns also 
consist of chunked arrays). While we currently require that the return value of 
{{\_\_arrow_array\_\_}} is a pyarrow.Array.

So I was wondering: could we relax this constraint and also allow ChunkedArray 
as return value? 
However, this protocol is currently called in the {{pa.array(..)}} function, 
which probably should keep returning an Array (and not ChunkedArray in certain 
cases).

cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7023) [Python] pa.array does not use "from_pandas" semantics for pd.Index

2019-11-05 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-7023.
--
Resolution: Fixed

Issue resolved by pull request 5753
 [https://github.com/apache/arrow/pull/5753]

> [Python] pa.array does not use "from_pandas" semantics for pd.Index
> ---
>
> Key: ARROW-7023
> URL: https://issues.apache.org/jira/browse/ARROW-7023
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> {code}
> In [15]: idx = pd.Index([1, 2, np.nan], dtype=object) 
>   
>
> In [16]: pa.array(idx)
>   
>
> Out[16]: 
> 
> [
>   1,
>   2,
>   nan
> ]
> In [17]: pa.array(idx, from_pandas=True)  
>   
>
> Out[17]: 
> 
> [
>   1,
>   2,
>   null
> ]
> In [18]: pa.array(pd.Series(idx)) 
>   
>
> Out[18]: 
> 
> [
>   1,
>   2,
>   null
> ]
> {code}
> We should probably handle Series and Index the same in this regard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7068) [C++] Expose the offsets of a ListArray as a Int32Array

2019-11-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7068:


 Summary: [C++] Expose the offsets of a ListArray as a Int32Array
 Key: ARROW-7068
 URL: https://issues.apache.org/jira/browse/ARROW-7068
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


As follow-up on ARROW-7031 (https://github.com/apache/arrow/pull/5759), we can 
move this into C++ and use that implementation from Python.

 

Cfr [https://github.com/apache/arrow/pull/5759#discussion_r342244521,] this 
could be a \{{ListArray::value_offsets_array}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968345#comment-16968345
 ] 

Joris Van den Bossche commented on ARROW-7071:
--

> NB: I'm not sure what kind of pitfalls there might be if replacing an 
> existing validity bitmap and exposing some previously-null values

I would say this is the responsibility of the user then? 
What could happen? Are there potentially cases where interpreting the memory of 
a previously-null value as a value leads to segfaults? Like if you would do:

{code}
In [62]: a = pa.array([1, None, 3]) 

   

In [63]: np.frombuffer(a.buffers()[1], dtype="int64")   

   
Out[63]: array([1, 0, 3])
{code}

> [Python] Add Array convenience method to create "masked" view with different 
> validity bitmap
> 
>
> Key: ARROW-7071
> URL: https://issues.apache.org/jira/browse/ARROW-7071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> NB: I'm not sure what kind of pitfalls there might be if replacing an 
> existing validity bitmap and exposing some previously-null values



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968354#comment-16968354
 ] 

Joris Van den Bossche commented on ARROW-7071:
--

Now, I think the main question is: what API could we offer for this?

* A method on Array? Something like {{array.set_validity_bitmap(..)}} or 
{{array.set_null_bitmap(..)}} (but not sure if it needs to be that clearly 
exposed)
* A settable attribute like {{array.null_bitmap}}
* A function to create a new array from a given array + bitmap? This could be 
similar to {{Array.from_buffers}}, but then a bit more convenient to use (as 
currently you can already use that to achieve this purpose)
* Alternative could be to expand {{pa.array(values, mask=[..])}} to accept a 
pyarrow array as values, and then use the {{mask}} keyword to specify the nulls 
as a boolean mask (although the current behaviour here is to have the final 
bitmap be a combination of nulls in the values and the mask, so this is not a 
way to override the bitmap, but maybe that's actually good)

A way to avoid the issue of "previously-null values" could also be to only 
allow setting the bitmap if there was not yet one before.

That would be enough for my original use case for this, where I want to create 
a StructArray from two pyarrow arrays, but also give it a null bitmap:

{code}
pa.StructArray.from_arrays([pa.array([1, 2, 3]), pa.array([2, 3, 4])], 
names=['a', 'b'])
{code}

For this very specific case, an option could also be to be able to pass a 
bitmap or mask keyword to {{pa.StructArray.from_arrays}}, but that's of course 
not a general solution for other types.

> [Python] Add Array convenience method to create "masked" view with different 
> validity bitmap
> 
>
> Key: ARROW-7071
> URL: https://issues.apache.org/jira/browse/ARROW-7071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> NB: I'm not sure what kind of pitfalls there might be if replacing an 
> existing validity bitmap and exposing some previously-null values



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968370#comment-16968370
 ] 

Joris Van den Bossche commented on ARROW-6820:
--

To see the description in the (old) docs, this link can be used: 
https://github.com/apache/arrow/blob/apache-arrow-0.14.0/docs/source/format/Layout.rst#map-type
  

The link above of https://arrow.apache.org/docs/format/Layout.html#map-type no 
longer works and similar section is not available in 
https://arrow.apache.org/docs/format/Columnar.html, I suppose it was removed in 
the format docs refactor (ARROW-6820) because it is considered a logical type 
and not a physical type?

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968373#comment-16968373
 ] 

Joris Van den Bossche commented on ARROW-6820:
--

Another inconsistency is that Schema.fbs speaks about "entry", not "entries"

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968398#comment-16968398
 ] 

Joris Van den Bossche commented on ARROW-7076:
--

There are not yet binary wheels available for Python 3.8, so therefore pip is 
trying to build from source.  And then it appears something goes wrong with 
installing/finding numpy, which seems similar to the error reported in 
ARROW-5210. As I mentioned there, this is an error in the pyproject.toml that 
we do not list numpy as a build dependency (pip will create a new environment 
with all build dependencies, therefore installing numpy before hand does not 
solve it).

Now, even if the pyproject.toml would correctly list this, it is quite likely 
that installing from source with just {{pip install pyarrow}} is not going to 
work, as there are a lot of other (non-python) dependencies that you would need 
to ensure are available. If you do want to install from source, see 
https://arrow.apache.org/docs/developers/python.html#python-development for 
detailed instructions), otherwise you will need to wait until there are wheels 
available or use Python 3.7 instead of 3.8.

> `pip install pyarrow` with python 3.8 fail with message : Could not build 
> wheels for pyarrow which use PEP 517 and cannot be installed directly
> ---
>
> Key: ARROW-7076
> URL: https://issues.apache.org/jira/browse/ARROW-7076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Ubuntu 19.10 / Python 3.8.0
>Reporter: Fabien
>Priority: Minor
>
> When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works.
> However with python 3.8.0 it fails with the following error :
> {noformat}
> 14:06 $ pip install pyarrow
> Collecting pyarrow
>  Using cached 
> https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
> Collecting numpy>=1.14
>  Using cached 
> https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl
> Collecting six>=1.0.0
>  Using cached 
> https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 
> /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py
>  build_wheel /tmp/tmp4gpyu82j
>  cwd: /tmp/pip-install-cj5ucedq/pyarrow
>  Complete output (490 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.linux-x86_64-3.8
>  creating build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow
>  creating build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_strategies.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_array.py -> 
> build/lib.linux-x8

[jira] [Comment Edited] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968398#comment-16968398
 ] 

Joris Van den Bossche edited comment on ARROW-7076 at 11/6/19 2:31 PM:
---

There are not yet binary wheels available for Python 3.8, so therefore pip is 
trying to build from source.  And then it appears something goes wrong with 
installing/finding numpy, which seems similar to the error reported in 
ARROW-5210. As I mentioned there, this is an error in the pyproject.toml that 
we do not list numpy as a build dependency (pip will create a new environment 
with all build dependencies, therefore installing numpy before hand does not 
solve it).

Now, even if the pyproject.toml would correctly list this, it is quite likely 
that installing from source with just {{pip install pyarrow}} is not going to 
work, as there are a lot of other (non-python) dependencies that you would need 
to ensure are available. If you do want to install from source, see 
https://arrow.apache.org/docs/developers/python.html#python-development for 
detailed instructions), otherwise you will need to wait until there are wheels 
available, or use Python 3.7 instead of 3.8, or use conda instead (conda-forge 
already has binary packages of pyarrow for Python 3.8).


was (Author: jorisvandenbossche):
There are not yet binary wheels available for Python 3.8, so therefore pip is 
trying to build from source.  And then it appears something goes wrong with 
installing/finding numpy, which seems similar to the error reported in 
ARROW-5210. As I mentioned there, this is an error in the pyproject.toml that 
we do not list numpy as a build dependency (pip will create a new environment 
with all build dependencies, therefore installing numpy before hand does not 
solve it).

Now, even if the pyproject.toml would correctly list this, it is quite likely 
that installing from source with just {{pip install pyarrow}} is not going to 
work, as there are a lot of other (non-python) dependencies that you would need 
to ensure are available. If you do want to install from source, see 
https://arrow.apache.org/docs/developers/python.html#python-development for 
detailed instructions), otherwise you will need to wait until there are wheels 
available or use Python 3.7 instead of 3.8.

> `pip install pyarrow` with python 3.8 fail with message : Could not build 
> wheels for pyarrow which use PEP 517 and cannot be installed directly
> ---
>
> Key: ARROW-7076
> URL: https://issues.apache.org/jira/browse/ARROW-7076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Ubuntu 19.10 / Python 3.8.0
>Reporter: Fabien
>Priority: Minor
>
> When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works.
> However with python 3.8.0 it fails with the following error :
> {noformat}
> 14:06 $ pip install pyarrow
> Collecting pyarrow
>  Using cached 
> https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
> Collecting numpy>=1.14
>  Using cached 
> https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl
> Collecting six>=1.0.0
>  Using cached 
> https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 
> /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py
>  build_wheel /tmp/tmp4gpyu82j
>  cwd: /tmp/pip-install-cj5ucedq/pyarrow
>  Complete output (490 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.linux-x86_64-3.8
>  creating build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/py

[jira] [Commented] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly

2019-11-06 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968417#comment-16968417
 ] 

Joris Van den Bossche commented on ARROW-7076:
--

See ARROW-6920 for wheels for Python 3.8 (I suppose they will only get added 
for the latest pyarrow release, 0.15.1)

> `pip install pyarrow` with python 3.8 fail with message : Could not build 
> wheels for pyarrow which use PEP 517 and cannot be installed directly
> ---
>
> Key: ARROW-7076
> URL: https://issues.apache.org/jira/browse/ARROW-7076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Ubuntu 19.10 / Python 3.8.0
>Reporter: Fabien
>Priority: Minor
>
> When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works.
> However with python 3.8.0 it fails with the following error :
> {noformat}
> 14:06 $ pip install pyarrow
> Collecting pyarrow
>  Using cached 
> https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
> Collecting numpy>=1.14
>  Using cached 
> https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl
> Collecting six>=1.0.0
>  Using cached 
> https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 
> /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py
>  build_wheel /tmp/tmp4gpyu82j
>  cwd: /tmp/pip-install-cj5ucedq/pyarrow
>  Complete output (490 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.linux-x86_64-3.8
>  creating build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow
>  creating build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_strategies.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_array.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_json.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_cython.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_deprecations.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_memory.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_io.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/pandas_examples.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_compute.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/util.py -> build/lib.linux-x86_64-3.8/py

[jira] [Assigned] (ARROW-3444) [Python] Table.nbytes attribute

2019-11-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-3444:


Assignee: Joris Van den Bossche

> [Python] Table.nbytes attribute
> ---
>
> Key: ARROW-3444
> URL: https://issues.apache.org/jira/browse/ARROW-3444
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Dave Hirschfeld
>Assignee: Joris Van den Bossche
>Priority: Minor
> Fix For: 1.0.0
>
>
> As it says in the title, I think this would be a very handy attribute to have 
> available in Python. You can get it by converting to pandas and using 
> `DataFrame.nbytes` but this is wasteful of both time and memory so it would 
> be good to have this information on the `pyarrow.Table` object itself.
> This could be implemented using the 
> [__sizeof__|https://docs.python.org/3/library/sys.html#sys.getsizeof] protocol



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap

2019-11-11 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972151#comment-16972151
 ] 

Joris Van den Bossche commented on ARROW-7071:
--

Would it then be OK to say that "it is the responsibility of the user to not 
expose undefined values" ? (so that you are only adding nulls) Or do we need to 
guard for this?



> [Python] Add Array convenience method to create "masked" view with different 
> validity bitmap
> 
>
> Key: ARROW-7071
> URL: https://issues.apache.org/jira/browse/ARROW-7071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> NB: I'm not sure what kind of pitfalls there might be if replacing an 
> existing validity bitmap and exposing some previously-null values



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?

2019-11-11 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972155#comment-16972155
 ] 

Joris Van den Bossche commented on ARROW-7066:
--

I still don't fully like returning a chunked array from {{pa.array}}, but also 
don't see an easy other solution to otherwise get the roundtrip working for eg 
fletcher that uses chunked arrays (alternative would be to have an "internal" 
version of {{pa.array(..)}} that allows this, and keep the public one strict, 
but that is also rather ugly).

I will add some documentation update to the current open PR.

> [Python] support returning ChunkedArray from __arrow_array__ ?
> --
>
> Key: ARROW-7066
> URL: https://issues.apache.org/jira/browse/ARROW-7066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can 
> define how they should be converted to a pyarrow Array (similar to numpy's 
> {{\_\_array\_\_}}). This is then also used to support converting pandas 
> DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if 
> the pandas ExtensionArray, such as nullable integer type, implements this 
> {{\_\_arrow_array\_\_}} method).
> This last use case could also be useful for fletcher 
> (https://github.com/xhochy/fletcher/, a package that implements pandas 
> ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a 
> pandas DataFrame).  
> However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a 
> pandas DataFrame (to have a better mapping with a Table, where the columns 
> also consist of chunked arrays). While we currently require that the return 
> value of {{\_\_arrow_array\_\_}} is a pyarrow.Array.
> So I was wondering: could we relax this constraint and also allow 
> ChunkedArray as return value? 
> However, this protocol is currently called in the {{pa.array(..)}} function, 
> which probably should keep returning an Array (and not ChunkedArray in 
> certain cases).
> cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-13 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973192#comment-16973192
 ] 

Joris Van den Bossche commented on ARROW-6820:
--

If both C++ and Java use "entries", we can also update the format spec? (since 
it is not a required name and only a recommendation, I would think it is not 
really a "format change" to update that description?)

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7154:


 Summary: [C++] Build error when building tests but not with snappy
 Key: ARROW-7154
 URL: https://issues.apache.org/jira/browse/ARROW-7154
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Since the docker-compose PR landed, I am having build errors like:
{code:java}
[361/376] Linking CXX executable debug/arrow-python-test
FAILED: debug/arrow-python-test
: && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
/home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
-Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
-fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
-Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
-msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now 
-Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
debug/arrow-python-test  
-Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
 debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread 
-ldl  -lutil -lrt -ldl 
/home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
/home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
/home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
not found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::system::detail::generic_category_ncx()'
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::filesystem::path::operator/=(boost::filesystem::path const&)'
collect2: error: ld returned 1 exit status
{code}
which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by 
debug/libarrow.so.100.0.0, not found" (although this is certainly present).

The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to 
OFF, it works fine.

It also seems to be related to this specific change in the docker compose PR:
{code:java}
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index c80ac3310..3b3c9eb8f 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -266,6 +266,15 @@ endif(UNIX)
 # Set up various options
 #

-if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
-  # Currently the compression tests require at least these libraries; bz2 and
-  # zstd are optional. See ARROW-3984
-  set(ARROW_WITH_BROTLI ON)
-  set(ARROW_WITH_LZ4 ON)
-  set(ARROW_WITH_SNAPPY ON)
-  set(ARROW_WITH_ZLIB ON)
-endif()
-
 if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
   set(ARROW_JSON ON)
 endif()
{code}

If I add that back, the build works.

With only `set(ARROW_WITH_BROTLI ON)`, it still fails
 With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
liblz4 instead of libboost (but also liblz4 is actually present)
 With only `set(ARROW_WITH_SNAPPY ON)`, it works
 With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
libz.so.1 not found

So it seems that the absence of snappy causes others to fail.

In the recommended build settings in the development docs 
([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),]
 the compression libraries are enabled. But I was still building without them 
(stemming from the time they were enabled by default). So I was using:

{code}
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \
 -DCMAKE_INSTALL_LIBDIR=lib \
 -DARROW_PARQUET=ON \
 -DARROW_PYTHON=ON \
 -DARROW_BUILD_TESTS=ON \
 ..
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-13 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7154:
-
Description: 
Since the docker-compose PR landed, I am having build errors like:
{code:java}
[361/376] Linking CXX executable debug/arrow-python-test
FAILED: debug/arrow-python-test
: && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
/home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
-Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
-fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
-Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
-msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now 
-Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
debug/arrow-python-test  
-Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
 debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread 
-ldl  -lutil -lrt -ldl 
/home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
/home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
/home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
not found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::system::detail::generic_category_ncx()'
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::filesystem::path::operator/=(boost::filesystem::path const&)'
collect2: error: ld returned 1 exit status
{code}
which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by 
debug/libarrow.so.100.0.0, not found" (although this is certainly present).

The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to 
OFF, it works fine.

It also seems to be related to this specific change in the docker compose PR:
{code:java}
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index c80ac3310..3b3c9eb8f 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -266,6 +266,15 @@ endif(UNIX)
 # Set up various options
 #

-if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
-  # Currently the compression tests require at least these libraries; bz2 and
-  # zstd are optional. See ARROW-3984
-  set(ARROW_WITH_BROTLI ON)
-  set(ARROW_WITH_LZ4 ON)
-  set(ARROW_WITH_SNAPPY ON)
-  set(ARROW_WITH_ZLIB ON)
-endif()
-
 if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
   set(ARROW_JSON ON)
 endif()
{code}

If I add that back, the build works.

With only `set(ARROW_WITH_BROTLI ON)`, it still fails
 With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
liblz4 instead of libboost (but also liblz4 is actually present)
 With only `set(ARROW_WITH_SNAPPY ON)`, it works
 With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
libz.so.1 not found

With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also 
works.  So it seems that the absence of snappy causes others to fail.

In the recommended build settings in the development docs 
([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),]
 the compression libraries are enabled. But I was still building without them 
(stemming from the time they were enabled by default). So I was using:

{code}
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \
 -DCMAKE_INSTALL_LIBDIR=lib \
 -DARROW_PARQUET=ON \
 -DARROW_PYTHON=ON \
 -DARROW_BUILD_TESTS=ON \
 ..
{code}

  was:
Since the docker-compose PR landed, I am having build errors like:
{code:java}
[361/376] Linking CXX executable debug/arrow-python-test
FAILED: debug/arrow-python-test
: && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
/home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
-Wno-noexcept-type -fvisibility-inlines-hidde

[jira] [Commented] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-13 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973461#comment-16973461
 ] 

Joris Van den Bossche commented on ARROW-7154:
--

Creating a new conda env from scratch (which now has boost 1.70 instead of 1.68 
in my old env, not sure if that is relevant), and then the problem also went 
away. So it might be OK to close this issue.

> [C++] Build error when building tests but not with snappy
> -
>
> Key: ARROW-7154
> URL: https://issues.apache.org/jira/browse/ARROW-7154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Since the docker-compose PR landed, I am having build errors like:
> {code:java}
> [361/376] Linking CXX executable debug/arrow-python-test
> FAILED: debug/arrow-python-test
> : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
> /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
> -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
> -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
> -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
> -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
> -msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro 
> -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
> src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
> debug/arrow-python-test  
> -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
>  debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
> debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread 
> -lpthread -ldl  -lutil -lrt -ldl 
> /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
> /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
> /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
> not found (try using -rpath or -rpath-link)
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
> found (try using -rpath or -rpath-link)
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  debug/libarrow.so.100.0.0: undefined reference to 
> `boost::system::detail::generic_category_ncx()'
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  debug/libarrow.so.100.0.0: undefined reference to 
> `boost::filesystem::path::operator/=(boost::filesystem::path const&)'
> collect2: error: ld returned 1 exit status
> {code}
> which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed 
> by debug/libarrow.so.100.0.0, not found" (although this is certainly present).
> The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set 
> to OFF, it works fine.
> It also seems to be related to this specific change in the docker compose PR:
> {code:java}
> diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> index c80ac3310..3b3c9eb8f 100644
> --- a/cpp/CMakeLists.txt
> +++ b/cpp/CMakeLists.txt
> @@ -266,6 +266,15 @@ endif(UNIX)
>  # Set up various options
>  #
> -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
> -  # Currently the compression tests require at least these libraries; bz2 and
> -  # zstd are optional. See ARROW-3984
> -  set(ARROW_WITH_BROTLI ON)
> -  set(ARROW_WITH_LZ4 ON)
> -  set(ARROW_WITH_SNAPPY ON)
> -  set(ARROW_WITH_ZLIB ON)
> -endif()
> -
>  if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
>set(ARROW_JSON ON)
>  endif()
> {code}
> If I add that back, the build works.
> With only `set(ARROW_WITH_BROTLI ON)`, it still fails
>  With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
> liblz4 instead of libboost (but also liblz4 is actually present)
>  With only `set(ARROW_WITH_SNAPPY ON)`, it works
>  With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
> libz.so.1 not found
> With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also 
> works.  So it seems that the absence of snappy causes others to fail.
> In the recommended build settings in the developme

[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions

2019-11-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7167:


 Summary: [CI][Python] Add nightly tests for older pandas versions 
to Github Actions
 Key: ARROW-7167
 URL: https://issues.apache.org/jira/browse/ARROW-7167
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions

2019-11-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-7167:


Assignee: Joris Van den Bossche

> [CI][Python] Add nightly tests for older pandas versions to Github Actions
> --
>
> Key: ARROW-7167
> URL: https://issues.apache.org/jira/browse/ARROW-7167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7168:
-
Summary: [Python] pa.array() doesn't respect provided dictionary type with 
all NaNs  (was: pa.array() doesn't respect provided dictionary type with all 
NaNs)

> [Python] pa.array() doesn't respect provided dictionary type with all NaNs
> --
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect specified dictionary type

2019-11-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7168:
-
Summary: [Python] pa.array() doesn't respect specified dictionary type  
(was: [Python] pa.array() doesn't respect provided dictionary type with all 
NaNs)

> [Python] pa.array() doesn't respect specified dictionary type
> -
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974511#comment-16974511
 ] 

Joris Van den Bossche commented on ARROW-7168:
--

[~buhrmann] thanks for the report. When passing a type like that, I agree it 
should be honoured.

Some other observations:

Also when it's not all-NaN, the specified type gets ignored:

{code}
In [19]: cat = pd.Categorical(['a', 'b']) 

In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), 
ordered=False)  

In [21]: pa.array(cat, type=typ) 
Out[21]: 


-- dictionary:
  [
"a",
"b"
  ]
-- indices:
  [
0,
1
  ]

In [22]: pa.array(cat, type=typ).type  
Out[22]: DictionaryType(dictionary)
{code}

So I suppose it's a more general problem, not specifically related to this 
all-NaN case (it only appears for you in this case, as otherwise the specified 
type and the type from the data will probably match).

In the example I show here above, we should probably raise an error is the 
specified type is not compatible (string vs int categories).

> [Python] pa.array() doesn't respect provided dictionary type with all NaNs
> --
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6926) [Python] Support __sizeof__ protocol for Python objects

2019-11-19 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977581#comment-16977581
 ] 

Joris Van den Bossche commented on ARROW-6926:
--

I  started with implementing the {{nbytes}} attribute last week (ARROW-3444, 
which is merged now), with the idea of afterwards looking at {{sizeof}}.

Main question is if we just want to return what {{nbytes}} does (the number of 
bytes in the buffers), which is what the dask approximation does, or if we also 
want to include the size of the cython + C++ object. 

{{sys.getsizeof}} works out of the box for the cython object (but it ignores 
the relevant buffers):

{code}
In [38]: a = pa.array([1, 2])   

   

In [39]: import sys 

   

In [40]: sys.getsizeof(a)   

   
Out[40]: 96
{code}

but when overriding {{\_\_sizeof\_\_}} in Array, I am not sure how to get to 
this number so I can add the nbytes of the buffers to it.




> [Python] Support __sizeof__ protocol for Python objects
> ---
>
> Key: ARROW-6926
> URL: https://issues.apache.org/jira/browse/ARROW-6926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Matthew Rocklin
>Priority: Minor
> Fix For: 1.0.0
>
>
> It would be helpful if PyArrow objects implemented the `__sizeof__` protocol 
> to give other libraries hints about how much data they have allocated.  This 
> helps systems like Dask, which have to make judgements about whether or not 
> something is cheap to move or taking up a large amount of space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

2019-11-19 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-7209:


Assignee: Joris Van den Bossche

> [Python] tests with pandas master are failing now __from_arrow__ support 
> landed in pandas
> -
>
> Key: ARROW-7209
> URL: https://issues.apache.org/jira/browse/ARROW-7209
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>
> I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in 
> https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our 
> tests where assuming this did not yet work in pandas, and thus need to be 
> updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

2019-11-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7209:


 Summary: [Python] tests with pandas master are failing now 
__from_arrow__ support landed in pandas
 Key: ARROW-7209
 URL: https://issues.apache.org/jira/browse/ARROW-7209
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in 
https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our 
tests where assuming this did not yet work in pandas, and thus need to be 
updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7208) [Python] Passing directory to ParquetFile class gives confusing error message

2019-11-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7208:
-
Summary: [Python] Passing directory to ParquetFile class gives confusing 
error message  (was: Arrow using ParquetFile class)

> [Python] Passing directory to ParquetFile class gives confusing error message
> -
>
> Key: ARROW-7208
> URL: https://issues.apache.org/jira/browse/ARROW-7208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Roelant Stegmann
>Priority: Major
>
> Somehow have the same errors. We are working with pyarrow 0.15.1, trying to 
> access a folder of `parquet` files generated with Amazon Athena.
> ```python
> table2 = pq.read_table('C:/Data/test-parquet')
> ```
> works fine in contrast to
> ```python
> parquet_file = pq.ParquetFile('C:/Data/test-parquet')
> # parquet_file.read_row_group(0)
> ```
> which raises
> `ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: 
> Access is denied.`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7208) Arrow using ParquetFile class

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978177#comment-16978177
 ] 

Joris Van den Bossche commented on ARROW-7208:
--

The {{ParquetFile}} object expects a single file, not a directory of files (the 
{{read_table}} can handle both). 
If you want to use the object interface for a directory of files, you need to 
use {{pq.ParquetDataset}}.

A better error message would be useful though.

> Arrow using ParquetFile class
> -
>
> Key: ARROW-7208
> URL: https://issues.apache.org/jira/browse/ARROW-7208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Roelant Stegmann
>Priority: Major
>
> Somehow have the same errors. We are working with pyarrow 0.15.1, trying to 
> access a folder of `parquet` files generated with Amazon Athena.
> ```python
> table2 = pq.read_table('C:/Data/test-parquet')
> ```
> works fine in contrast to
> ```python
> parquet_file = pq.ParquetFile('C:/Data/test-parquet')
> # parquet_file.read_row_group(0)
> ```
> which raises
> `ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: 
> Access is denied.`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes

2019-11-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7214:
-
Fix Version/s: 1.0.0

> [Python] unpickling a pyarrow table with dictionary fields crashes
> --
>
> Key: ARROW-7214
> URL: https://issues.apache.org/jira/browse/ARROW-7214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1, 0.15.0, 0.15.1
>Reporter: Yevgeni Litvin
>Priority: Major
> Fix For: 1.0.0
>
>
> The following code crashes on this check:
> {code:java}
> F1120 07:51:37.523720 12432 array.cc:773]  Check failed: (data->dictionary) 
> != (nullptr) 
> {code}
>  
> {code:java}
> import cPickle as pickle
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ])
> df["cat"] = df["cat"].astype('category')index_table = 
> pa.Table.from_pandas(df, preserve_index=False)
> with open('/tmp/zz.pickle', 'wb') as f:
> pickle.dump(index_table, f, protocol=2)
> with open('/tmp/zz.pickle', 'rb') as f:
>index_table = pickle.load(f)
> {code}
>  
> Used Python2 with the following environment:
> {code:java}
> Package Version
> --- ---
> enum34  1.1.6  
> futures 3.3.0  
> numpy   1.16.5 
> pandas  0.24.2 
> pip 19.3.1 
> pyarrow 0.14.1 (0.14.0 and up suffer from this issue)
> python-dateutil 2.8.1  
> pytz2019.3 
> setuptools  41.6.0 
> six 1.13.0 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978178#comment-16978178
 ] 

Joris Van den Bossche commented on ARROW-7214:
--

[~selitvin] Thanks for the report! I can confirm this crash with latest arrow.

> [Python] unpickling a pyarrow table with dictionary fields crashes
> --
>
> Key: ARROW-7214
> URL: https://issues.apache.org/jira/browse/ARROW-7214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1, 0.15.0, 0.15.1
>Reporter: Yevgeni Litvin
>Priority: Major
> Fix For: 1.0.0
>
>
> The following code crashes on this check:
> {code:java}
> F1120 07:51:37.523720 12432 array.cc:773]  Check failed: (data->dictionary) 
> != (nullptr) 
> {code}
>  
> {code:java}
> import cPickle as pickle
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ])
> df["cat"] = df["cat"].astype('category')index_table = 
> pa.Table.from_pandas(df, preserve_index=False)
> with open('/tmp/zz.pickle', 'wb') as f:
> pickle.dump(index_table, f, protocol=2)
> with open('/tmp/zz.pickle', 'rb') as f:
>index_table = pickle.load(f)
> {code}
>  
> Used Python2 with the following environment:
> {code:java}
> Package Version
> --- ---
> enum34  1.1.6  
> futures 3.3.0  
> numpy   1.16.5 
> pandas  0.24.2 
> pip 19.3.1 
> pyarrow 0.14.1 (0.14.0 and up suffer from this issue)
> python-dateutil 2.8.1  
> pytz2019.3 
> setuptools  41.6.0 
> six 1.13.0 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes

2019-11-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-7214:


Assignee: Joris Van den Bossche

> [Python] unpickling a pyarrow table with dictionary fields crashes
> --
>
> Key: ARROW-7214
> URL: https://issues.apache.org/jira/browse/ARROW-7214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1, 0.15.0, 0.15.1
>Reporter: Yevgeni Litvin
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> The following code crashes on this check:
> {code:java}
> F1120 07:51:37.523720 12432 array.cc:773]  Check failed: (data->dictionary) 
> != (nullptr) 
> {code}
>  
> {code:java}
> import cPickle as pickle
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ])
> df["cat"] = df["cat"].astype('category')index_table = 
> pa.Table.from_pandas(df, preserve_index=False)
> with open('/tmp/zz.pickle', 'wb') as f:
> pickle.dump(index_table, f, protocol=2)
> with open('/tmp/zz.pickle', 'rb') as f:
>index_table = pickle.load(f)
> {code}
>  
> Used Python2 with the following environment:
> {code:java}
> Package Version
> --- ---
> enum34  1.1.6  
> futures 3.3.0  
> numpy   1.16.5 
> pandas  0.24.2 
> pip 19.3.1 
> pyarrow 0.14.1 (0.14.0 and up suffer from this issue)
> python-dateutil 2.8.1  
> pytz2019.3 
> setuptools  41.6.0 
> six 1.13.0 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7208) [Python] Passing directory to ParquetFile class gives confusing error message

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978256#comment-16978256
 ] 

Joris Van den Bossche commented on ARROW-7208:
--

Looking at the ParquetDataset docs 
(https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html),
 it's indeed not clear how to read a part of it.

A ParquetDataset contains several "ParquetDatasetPiece"s, accessible as the 
{{pieces}} attribute, and then you can read a single piece. But this part of 
the API is not really documented. If you only want to read a single file of the 
full directory, you can also create a {{ParquetFile}} but specify the full file 
path instead of only the directory.

> [Python] Passing directory to ParquetFile class gives confusing error message
> -
>
> Key: ARROW-7208
> URL: https://issues.apache.org/jira/browse/ARROW-7208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Roelant Stegmann
>Priority: Major
>
> Somehow have the same errors. We are working with pyarrow 0.15.1, trying to 
> access a folder of `parquet` files generated with Amazon Athena.
> ```python
> table2 = pq.read_table('C:/Data/test-parquet')
> ```
> works fine in contrast to
> ```python
> parquet_file = pq.ParquetFile('C:/Data/test-parquet')
> # parquet_file.read_row_group(0)
> ```
> which raises
> `ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: 
> Access is denied.`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7217) [CI] Docker compose / github actions ignores PYTHON env

2019-11-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7217:
-
Summary: [CI] Docker compose / github actions ignores PYTHON env  (was: 
Docker compose / github actions ignores PYTHON env)

> [CI] Docker compose / github actions ignores PYTHON env
> ---
>
> Key: ARROW-7217
> URL: https://issues.apache.org/jira/browse/ARROW-7217
> Project: Apache Arrow
>  Issue Type: Test
>  Components: CI
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The "AMD64 Conda Python 2.7" build is actually using Python 3.6. 
> This python 3.6 version is written in the conda-python.dockerfile: 
> https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24
>  
> and I am not fully sure how the ENV variable overrides that or not
> cc [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7217) Docker compose / github actions ignores PYTHON env

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7217:


 Summary: Docker compose / github actions ignores PYTHON env
 Key: ARROW-7217
 URL: https://issues.apache.org/jira/browse/ARROW-7217
 Project: Apache Arrow
  Issue Type: Test
  Components: CI
Reporter: Joris Van den Bossche


The "AMD64 Conda Python 2.7" build is actually using Python 3.6. 

This python 3.6 version is written in the conda-python.dockerfile: 
https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24
 
and I am not fully sure how the ENV variable overrides that or not

cc [~kszucs]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7217) [CI] Docker compose / github actions ignores PYTHON env

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978302#comment-16978302
 ] 

Joris Van den Bossche commented on ARROW-7217:
--

Ah, I see that there is a PYTHON_VERSION in the dockerfile, but the github 
action workflow uses PYTHON

> [CI] Docker compose / github actions ignores PYTHON env
> ---
>
> Key: ARROW-7217
> URL: https://issues.apache.org/jira/browse/ARROW-7217
> Project: Apache Arrow
>  Issue Type: Test
>  Components: CI
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The "AMD64 Conda Python 2.7" build is actually using Python 3.6. 
> This python 3.6 version is written in the conda-python.dockerfile: 
> https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24
>  
> and I am not fully sure how the ENV variable overrides that or not
> cc [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7218) [Python] Conversion from boolean numpy scalars not working

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7218:


 Summary: [Python] Conversion from boolean numpy scalars not working
 Key: ARROW-7218
 URL: https://issues.apache.org/jira/browse/ARROW-7218
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


In general, we are fine to accept a list of numpy scalars:

{code}
In [12]: type(list(np.array([1, 2]))[0])

   
Out[12]: numpy.int64

In [13]: pa.array(list(np.array([1, 2])))   

   
Out[13]: 

[
  1,
  2
]
{code}

But for booleans, this doesn't work:

{code}
In [14]: pa.array(list(np.array([True, False])))

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.array(list(np.array([True, False])))

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

ArrowInvalid: Could not convert True with type numpy.bool_: tried to convert to 
boolean
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7219) [CI][Python] Install pickle5 in the conda-python docker image for python version 3.6

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978489#comment-16978489
 ] 

Joris Van den Bossche commented on ARROW-7219:
--

There are other optional dependencies for python that would be nice to include 
somewhere as well (s3fs, fastparquet): 
https://github.com/apache/arrow/pull/5562#issuecomment-553782658 

> [CI][Python] Install pickle5 in the conda-python docker image for python 
> version 3.6
> 
>
> Key: ARROW-7219
> URL: https://issues.apache.org/jira/browse/ARROW-7219
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> See conversation 
> https://github.com/apache/arrow/pull/5873#discussion_r348510729



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7220:


 Summary: [CI] Docker compose (github actions) Mac Python 3 build 
is using Python 2
 Key: ARROW-7220
 URL: https://issues.apache.org/jira/browse/ARROW-7220
 Project: Apache Arrow
  Issue Type: Test
Reporter: Joris Van den Bossche


The "AMD64 MacOS 10.15 Python 3" build is also running in python 2.

Possibly related to how brew is installing python 2 / 3, or because it is using 
the system python, ... (not familiar with mac)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2

2019-11-20 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7220:
-
Component/s: CI

> [CI] Docker compose (github actions) Mac Python 3 build is using Python 2
> -
>
> Key: ARROW-7220
> URL: https://issues.apache.org/jira/browse/ARROW-7220
> Project: Apache Arrow
>  Issue Type: Test
>  Components: CI
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The "AMD64 MacOS 10.15 Python 3" build is also running in python 2.
> Possibly related to how brew is installing python 2 / 3, or because it is 
> using the system python, ... (not familiar with mac)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6926) [Python] Support __sizeof__ protocol for Python objects

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978546#comment-16978546
 ] 

Joris Van den Bossche commented on ARROW-6926:
--

Ah, thanks. But it seems cython is adding a bit more still:

{code}
In [21]: a = pa.array([1]*10)   

   

In [22]: sys.getsizeof(a)   

   
Out[22]: 96

In [23]: object.__sizeof__(a)   

   
Out[23]: 72
{code}

(not sure how much we care about those small numbers, in reality users will 
mainly care for big arrays where the nbytes dominates the result)

> [Python] Support __sizeof__ protocol for Python objects
> ---
>
> Key: ARROW-6926
> URL: https://issues.apache.org/jira/browse/ARROW-6926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Matthew Rocklin
>Priority: Minor
> Fix For: 1.0.0
>
>
> It would be helpful if PyArrow objects implemented the `__sizeof__` protocol 
> to give other libraries hints about how much data they have allocated.  This 
> helps systems like Dask, which have to make judgements about whether or not 
> something is cheap to move or taking up a large amount of space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6926) [Python] Support __sizeof__ protocol for Python objects

2019-11-20 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978558#comment-16978558
 ] 

Joris Van den Bossche commented on ARROW-6926:
--

OK, thanks!

> [Python] Support __sizeof__ protocol for Python objects
> ---
>
> Key: ARROW-6926
> URL: https://issues.apache.org/jira/browse/ARROW-6926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Matthew Rocklin
>Priority: Minor
> Fix For: 1.0.0
>
>
> It would be helpful if PyArrow objects implemented the `__sizeof__` protocol 
> to give other libraries hints about how much data they have allocated.  This 
> helps systems like Dask, which have to make judgements about whether or not 
> something is cheap to move or taking up a large amount of space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7222) [Python] Wipe any existing generated Python API documentation when updating website

2019-11-21 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979100#comment-16979100
 ] 

Joris Van den Bossche commented on ARROW-7222:
--

It could also be an option to keep older versions in a /docs/version/xx/ ? 
(although that's maybe a bit unnecessary overhead for now)

> [Python] Wipe any existing generated Python API documentation when updating 
> website
> ---
>
> Key: ARROW-7222
> URL: https://issues.apache.org/jira/browse/ARROW-7222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Removed APIs are persisting in Google searches, e.g.
> https://arrow.apache.org/docs/python/generated/pyarrow.Column.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7222) [Python] Wipe any existing generated Python API documentation when updating website

2019-11-21 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979135#comment-16979135
 ] 

Joris Van den Bossche commented on ARROW-7222:
--

It's indeed a different problem (and solving it now will require explicit 
action), but the solution to prevent it happening again in the future might be 
related. 
Eg in pandas, we put the docs for each version in a /version/xx/ directory, and 
then /stable/ is a symlink to the latest version (which needs to be updated 
when releasing then). That way, you never overwrite the existing docs with a 
new  set of files, potentially leaving older ones (now, ensuring the old ones 
are deleted when overwriting the docs should also not be hard, of course)

> [Python] Wipe any existing generated Python API documentation when updating 
> website
> ---
>
> Key: ARROW-7222
> URL: https://issues.apache.org/jira/browse/ARROW-7222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Removed APIs are persisting in Google searches, e.g.
> https://arrow.apache.org/docs/python/generated/pyarrow.Column.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-21 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979326#comment-16979326
 ] 

Joris Van den Bossche commented on ARROW-1644:
--

[~RinkeHoekstra] that looks unrelated (the json reader is mostly independent 
from the parquet IO).  Can you open a separate JIRA ticket?

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7226) [JSON][Python] Json loader fails on example in documentation.

2019-11-21 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979375#comment-16979375
 ] 

Joris Van den Bossche commented on ARROW-7226:
--

So this may not be adequately documented, but currently the json reader _only_ 
supports line-delimited json. So that is the reason the documentation shows the 
example using that format.

> [JSON][Python] Json loader fails on example in documentation.
> -
>
> Key: ARROW-7226
> URL: https://issues.apache.org/jira/browse/ARROW-7226
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Rinke Hoekstra
>Priority: Major
>
> I was just trying this with the example found in the pyarrow docs at 
> [http://arrow.apache.org/docs/python/json.html]
> The documented example does not work. Is this related to this issue, or is it 
> another matter?
> It says to load the following JSON file:
> {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"
> I fixed this to make it valid JSON (It is valid [JSON 
> Lines|[http://jsonlines.org/]], but that's another issue):
> {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}}
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}}
> Then reading the JSON from a file called `my_data.json`:
> {{from pyarrow import json}}
>  {{table = json.read_json("my_data.json")}}
> Gives the following error:
> {code:java}
> ---}}
>  ArrowInvalid Traceback (most recent call last)
>   in ()
>  1 from pyarrow import json
>  > 2 table = json.read_json('test.json')
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx
>  in pyarrow._json.read_json()
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: JSON parse error: A column changed from object to array
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type

2019-11-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7261:


 Summary: [Python] Python support for fixed size list type
 Key: ARROW-7261
 URL: https://issues.apache.org/jira/browse/ARROW-7261
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is 
not yet exposed in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7268) [Rust] Propagate `custom_metadata` field from IPC message

2019-11-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7268:
-
Summary: [Rust] Propagate `custom_metadata` field from IPC message  (was: 
Propagate `custom_metadata` field from IPC message)

> [Rust] Propagate `custom_metadata` field from IPC message
> -
>
> Key: ARROW-7268
> URL: https://issues.apache.org/jira/browse/ARROW-7268
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Martin Grund
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Right now, the custom metadata field in the Schema IPC message is not 
> propagated from the IPC message to the internal data type. To be closer to 
> parity compared to the other implementations it would be good to add the 
> necessary logic to serialize and deserialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7266) [Python] dictionary_encode() of a slice gives wrong result

2019-11-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7266:
-
Summary: [Python] dictionary_encode() of a slice gives wrong result  (was: 
dictionary_encode() of a slice gives wrong result)

> [Python] dictionary_encode() of a slice gives wrong result
> --
>
> Key: ARROW-7266
> URL: https://issues.apache.org/jira/browse/ARROW-7266
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
> Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
>Reporter: Adam Hooper
>Priority: Major
>
> Steps to reproduce:
> {code:python}
> import pyarrow as pa
> arr = pa.array(["a", "b", "b", "b"])[1:]
> arr.dictionary_encode()
> {code}
> Expected results:
> {code}
> -- dictionary:
>   [
> "b"
>   ]
> -- indices:
>   [
> 0,
> 0,
> 0
>   ]
> {code}
> Actual results:
> {code}
> -- dictionary:
>   [
> "b",
> ""
>   ]
> -- indices:
>   [
> 0,
> 0,
> 1
>   ]
> {code}
> I don't know a workaround. Converting to pylist and back is too slow. Is 
> there a way to copy the slice to a new offset-0 StringArray that I could then 
> dictionary-encode? Otherwise, I'm considering building buffers by hand



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7266) [Python] dictionary_encode() of a slice gives wrong result

2019-11-27 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983548#comment-16983548
 ] 

Joris Van den Bossche commented on ARROW-7266:
--

[~adamhooper] Thanks of the report!

This seems to be specific to the string type, as I don't see a similar bug for 
integer type:

{code}
In [7]: a = pa.array(['a', 'b', 'c', 'b'])  

   

In [9]: a[1:].dictionary_encode()   

   
Out[9]: 


-- dictionary:
  [
"c",
"b",
""
  ]
-- indices:
  [
0,
1,
2
  ]

In [10]: a = pa.array([1, 2, 3, 2]) 

   

In [12]: a[1:].dictionary_encode()  

   
Out[12]: 


-- dictionary:
  [
2,
3
  ]
-- indices:
  [
0,
1,
0
  ]

{code}


>  Is there a way to copy the slice to a new offset-0 StringArray that I could 
> then dictionary-encode? 

At least in the current pyarrow API, I don't think such a functionality is 
exposed (apart from getting buffers, slicing/copying, and recreating an array)

> [Python] dictionary_encode() of a slice gives wrong result
> --
>
> Key: ARROW-7266
> URL: https://issues.apache.org/jira/browse/ARROW-7266
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
> Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
>Reporter: Adam Hooper
>Priority: Major
>
> Steps to reproduce:
> {code:python}
> import pyarrow as pa
> arr = pa.array(["a", "b", "b", "b"])[1:]
> arr.dictionary_encode()
> {code}
> Expected results:
> {code}
> -- dictionary:
>   [
> "b"
>   ]
> -- indices:
>   [
> 0,
> 0,
> 0
>   ]
> {code}
> Actual results:
> {code}
> -- dictionary:
>   [
> "b",
> ""
>   ]
> -- indices:
>   [
> 0,
> 0,
> 1
>   ]
> {code}
> I don't know a workaround. Converting to pylist and back is too slow. Is 
> there a way to copy the slice to a new offset-0 StringArray that I could then 
> dictionary-encode? Otherwise, I'm considering building buffers by hand



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0

2019-11-27 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983565#comment-16983565
 ] 

Joris Van den Bossche commented on ARROW-6876:
--

[~axelg] would you be able to share a reproducible example ? (eg the data, or 
code that creates a dummy dataset with the same characteristics that shows the 
problem)

> [Python] Reading parquet file with many columns becomes slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0

2019-11-27 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983593#comment-16983593
 ] 

Joris Van den Bossche commented on ARROW-6876:
--

Ah, sorry, missed the "With the reproducer above:" in your message. 

I see a similar difference locally, it's indeed not the speed-up that [~wesm] 
reported on the PR: 
https://github.com/apache/arrow/pull/5653#issuecomment-541901845 (this might 
depend on the machine / number of cores ?)

> [Python] Reading parquet file with many columns becomes slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    4   5   6   7   8   9   10   11   12   13   >