Re: Help in reconciling how arrow helps with columnar processing?

2017-12-04 Thread Matan Safriel
Thanks Wes,

This makes really a lot of sense, and I'll keep the references for my reference!

Matan

Sent from my iPad

> On 4 Dec 2017, at 17:52, Wes McKinney  wrote:
> 
> hi Matan,
> 
> I recommend this presentation for a detailed discussion of these
> points: 
> https://www.slideshare.net/julienledem/the-columnar-roadmap-apache-parquet-and-apache-arrow
> 
> To your questions:
> 
> 1. Arrow's "fully shredded" columnar representation ensures a few things
> 
> * Reliable data locality for scan operations on all data types (for
> example, consecutive strings in Arrow are guaranteed to be next to
> each other in memory)
> * Contiguous memory plus buffer alignment / padding permits consistent
> use of SIMD, if available
> 
> 2. We have developed a zero-copy messaging / IPC framework that
> enables interacting with arbitrary-size Arrow memory in any virtual
> address space without copying or deserialization -- in brief, a
> dataset is accompanied by a metadata descriptor (serialized using the
> Google Flatbuffers library) that indicates the locations of each
> memory block constituting a particular column in a particular table.
> So we can locate the memory offset corresponding to a particular cell
> in a dataset in O(1) time, and scan data in shared memory / memory
> maps without having to materialize copies in RAM
> 
> See http://arrow.apache.org/docs/ipc.html for a discussion of the
> messaging protocol. Earlier this year I wrote about how this enables
> very fast movement of streaming tabular data in
> http://wesmckinney.com/blog/arrow-streaming-columnar/
> 
> Thanks
> Wes
> 
>> On Sat, Dec 2, 2017 at 11:50 AM, Daniel Lemire  wrote:
>> I don't know the answer per se but my understanding is that
>> Arrow enables ccmputational kernels that can be highly optimized.
>> I plan to do some work in this direction myself.
>> 
>> - Daniel
>> 
>> 
>> Hi,
>>> 
>>> I wonder if anyone can comment on how does Apache Arrow accomplish, or help
>>> accomplish the following, taken from the Apache page
>>> :
>>> 
>>> Apache Arrow™ enables execution engines to take advantage of the latest
>>> SIMD (Single input multiple data) operations included in modern processors,
>>> for native vectorized optimization of analytical data processing. Columnar
>>> layout is optimized for data locality for better performance on modern
>>> hardware like CPUs and GPUs.
>>> 
>>> The Arrow memory format supports *zero-copy reads* for lightning-fast data
>>> access without serialization overhead.
>>> Can anyone provide information concerning how the standard specifically
>>> helps with those concerns, in particular the ones highlighted above?
>>> 
>>> Disclaimer: I've not read the source or the source of the related repos.
>>> 
>>> Many thanks!
>>> Matan
>>> 


[jira] [Created] (ARROW-1886) [Python] Add function to "explode" structs within tables

2017-12-04 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1886:
---

 Summary: [Python] Add function to "explode" structs within tables
 Key: ARROW-1886
 URL: https://issues.apache.org/jira/browse/ARROW-1886
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


See discussion in https://issues.apache.org/jira/browse/ARROW-1873

When a user has a struct column, it may be more efficient to flatten the struct 
into multiple columns of the form {{struct_name.field_name}} for each field in 
the struct. Then when you call {{to_pandas}}, Python dictionaries do not have 
to be created, and the conversion will be much more efficient



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Coordinating Arrow 0.8.0 end-game

2017-12-04 Thread Wes McKinney
hi folks,

I wanted to start a coordination thread to stay on top of the
remaining items to be able to get to a release-able state for 0.8.0,
by end of this week or beginning of next week with any luck. There are
a number of housekeeping items we are working on the C++/Python side,
but some of these are non-essential, so we'll try to get done as much
as we can.

Here are the major things that seem to be in flight:

Format changes
--

* ARROW-1785 Removing VectorLayout from metadata
https://github.com/apache/arrow/pull/1297

Other Java changes
--
* ARROW-1864 Upgrading Netty to 4.1.x https://github.com/apache/arrow/pull/1376

Other needed C++ changes
--
* ARROW-1882: Restoring DictionaryBuilder

Changes under discussion

* ARROW-1816 Possible change to Timestamp class structure
https://github.com/apache/arrow/pull/1330
* ARROW-1866 / ARROW-1815 Handling of NonNullableMapVector
https://github.com/apache/arrow/pull/1371

Other changes in TODO
-
* ARROW-1868: Change vector getMinorType to use MinorType instead of
Types.MinorType
* ARROW-1867: Add BitVector APIs from old vector class
* ARROW-1818: Examine Java Dependencies

Out of these, the ARROW-1785 VectorLayout removal and the Netty
upgrade are probably the most critical -- these should be ready to be
merged once they've been sufficiently reviewed. What else is essential
/ cannot fall through the cracks? If we are not able to arrive at
consensus on all other items, I would prefer not to delay the release
any further.

Thanks
Wes


Re: Help in reconciling how arrow helps with columnar processing?

2017-12-04 Thread Wes McKinney
hi Matan,

I recommend this presentation for a detailed discussion of these
points: 
https://www.slideshare.net/julienledem/the-columnar-roadmap-apache-parquet-and-apache-arrow

To your questions:

1. Arrow's "fully shredded" columnar representation ensures a few things

* Reliable data locality for scan operations on all data types (for
example, consecutive strings in Arrow are guaranteed to be next to
each other in memory)
* Contiguous memory plus buffer alignment / padding permits consistent
use of SIMD, if available

2. We have developed a zero-copy messaging / IPC framework that
enables interacting with arbitrary-size Arrow memory in any virtual
address space without copying or deserialization -- in brief, a
dataset is accompanied by a metadata descriptor (serialized using the
Google Flatbuffers library) that indicates the locations of each
memory block constituting a particular column in a particular table.
So we can locate the memory offset corresponding to a particular cell
in a dataset in O(1) time, and scan data in shared memory / memory
maps without having to materialize copies in RAM

See http://arrow.apache.org/docs/ipc.html for a discussion of the
messaging protocol. Earlier this year I wrote about how this enables
very fast movement of streaming tabular data in
http://wesmckinney.com/blog/arrow-streaming-columnar/

Thanks
Wes

On Sat, Dec 2, 2017 at 11:50 AM, Daniel Lemire  wrote:
> I don't know the answer per se but my understanding is that
> Arrow enables ccmputational kernels that can be highly optimized.
> I plan to do some work in this direction myself.
>
> - Daniel
>
>
> Hi,
>>
>> I wonder if anyone can comment on how does Apache Arrow accomplish, or help
>> accomplish the following, taken from the Apache page
>> :
>>
>> Apache Arrow™ enables execution engines to take advantage of the latest
>> SIMD (Single input multiple data) operations included in modern processors,
>> for native vectorized optimization of analytical data processing. Columnar
>> layout is optimized for data locality for better performance on modern
>> hardware like CPUs and GPUs.
>>
>> The Arrow memory format supports *zero-copy reads* for lightning-fast data
>> access without serialization overhead.
>> Can anyone provide information concerning how the standard specifically
>> helps with those concerns, in particular the ones highlighted above?
>>
>> Disclaimer: I've not read the source or the source of the related repos.
>>
>> Many thanks!
>> Matan
>>


[jira] [Created] (ARROW-1884) [C++] Make JsonReader/JsonWriter classes internal APIs

2017-12-04 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1884:
---

 Summary: [C++] Make JsonReader/JsonWriter classes internal APIs
 Key: ARROW-1884
 URL: https://issues.apache.org/jira/browse/ARROW-1884
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.8.0


These are exposed in the public API in {{arrow::ipc}}, and could possibly 
mislead users: http://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1883) [Python] BUG: Table.to_pandas metadata checking fails if columns are not present

2017-12-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-1883:


 Summary: [Python] BUG: Table.to_pandas metadata checking fails if 
columns are not present
 Key: ARROW-1883
 URL: https://issues.apache.org/jira/browse/ARROW-1883
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.1
Reporter: Joris Van den Bossche


Found this bug in the example in the pandas documentation (), which does:

```
df = pd.DataFrame({'a': list('abc'),
   'b': list(range(1, 4)),
   'c': np.arange(3, 6).astype('u1'),
   'd': np.arange(4.0, 7.0, dtype='float64'),
   'e': [True, False, True],
   'f': pd.date_range('20130101', periods=3),
   'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})

df.to_parquet('example_pa.parquet', engine='pyarrow')

pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
```

and this raises in the last line reading a subset of columns:

```
...
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
 in _add_any_metadata(table, pandas_metadata)
357 for i, col_meta in enumerate(pandas_metadata['columns']):
358 if col_meta['pandas_type'] == 'datetimetz':
--> 359 col = table[i]
360 converted = col.to_pandas()
361 tz = col_meta['metadata']['timezone']

table.pxi in pyarrow.lib.Table.__getitem__()

table.pxi in pyarrow.lib.Table.column()

IndexError: Table column index 6 is out of range
```

This is due to checking the `pandas_metadata` for all columns (and in this case 
trying to deal with a datetime tz column), while in practice not all columns 
are present in this case ('mismatch' between pandas metadata and actual 
schema). 

A smaller example without parquet:

```
In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", 
periods=3, tz='Europe/Brussels')})

In [39]: table = pyarrow.Table.from_pandas(df)

In [40]: table
Out[40]: 
pyarrow.Table
a: int64
b: timestamp[ns, tz=Europe/Brussels]
__index_level_0__: int64
metadata

{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}

In [41]: table.to_pandas()
Out[41]: 
   a b
0  1 2017-01-01 00:00:00+01:00
1  2 2017-01-02 00:00:00+01:00
2  3 2017-01-03 00:00:00+01:00

In [44]: table_without_tz = table.remove_column(1)

In [45]: table_without_tz
Out[45]: 
pyarrow.Table
a: int64
__index_level_0__: int64
metadata

{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}

In [46]: table_without_tz.to_pandas()  # <-- wrong output !
Out[46]: 
 a
1970-01-01 01:00:00+01:001
1970-01-01 01:00:00.1+01:00  2
1970-01-01 01:00:00.2+01:00  3

In [47]: table_without_tz2 = table_without_tz.remove_column(1)

In [48]: table_without_tz2
Out[48]: 
pyarrow.Table
a: int64
metadata

{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}

In [49]: table_without_tz2.to_pandas() # <-- error !
---
IndexErrorTraceback (most recent call last)
 in ()
> 1 table_without_tz2.to_pandas()

table.pxi in pyarrow.lib.Table.to_pandas()

/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
 in table_to_blockmanager(options, table, memory_pool, nthreads)
289 

[jira] [Created] (ARROW-1882) [C++] Reintroduce DictionaryBuilder

2017-12-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-1882:
--

 Summary: [C++] Reintroduce DictionaryBuilder
 Key: ARROW-1882
 URL: https://issues.apache.org/jira/browse/ARROW-1882
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Uwe L. Korn
Priority: Critical
 Fix For: 0.8.0


We need the {{DictionaryBuilder}} to incrementally build Arrow Arrays of 
{{DictionaryType}}. The kernels only support en-bloc conversions of Arrays 
which yields a higher memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1881) [Python] setuptools_scm picks up JS version tags

2017-12-04 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-1881:
--

 Summary: [Python] setuptools_scm picks up JS version tags
 Key: ARROW-1881
 URL: https://issues.apache.org/jira/browse/ARROW-1881
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Uwe L. Korn
 Fix For: 0.8.0


Building wheels from the current master will end up in 
{{pyarrow-0.2.1.dev15+g3b438bc-cp36-cp36m-manylinux1_x86_64.whl}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)