[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1897:
-

 Summary: Incorrect numpy_type for pandas metadata of Categoricals
 Key: ARROW-1897
 URL: https://issues.apache.org/jira/browse/ARROW-1897
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Tom Augspurger
 Fix For: 0.9.0


If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{{
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-1897:
--
Description: 
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{code
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`

  was:
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{{
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`


> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> }
> From the spec:
> > The numpy_type is the physical storage type of the column, which is the 
> > result of str(dtype) for the underlying NumPy array that holds the data. So 
> > for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> > the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Augspurger updated ARROW-1897:
--
Description: 
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{code}
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}
{code}

>From the spec:

bq. The numpy_type is the physical storage type of the column, which is the 
result of str(dtype) for the underlying NumPy array that holds the data. So for 
datetimetz this is datetime64[ns] and for categorical, it may be any of the 
supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`

  was:
If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{code
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`


> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-1897:


Assignee: Phillip Cloud

> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>Assignee: Phillip Cloud
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281861#comment-16281861
 ] 

Phillip Cloud commented on ARROW-1897:
--

I think we can get this in for 0.8.0. I want to avoid another backwards compat 
issue so best to take care of as many of these as we can. 

> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>Assignee: Phillip Cloud
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.9.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1895) [Python] Add field_name to pandas index metadata

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281933#comment-16281933
 ] 

ASF GitHub Bot commented on ARROW-1895:
---

jorisvandenbossche commented on a change in pull request #1397: ARROW-1895: 
[Python] Add field_name to pandas index metadata
URL: https://github.com/apache/arrow/pull/1397#discussion_r155539165
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -160,9 +160,35 @@ def test_integer_index_column(self):
 df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')])
 _check_pandas_roundtrip(df, preserve_index=True)
 
+def test_index_metadata_field_name(self):
+df = pd.DataFrame(
+[(1, 'a', 3.1), (2, 'b', 2.2), (3, 'c', 1.3)],
+index=pd.MultiIndex.from_arrays(
+[['c', 'b', 'a'], [3, 2, 1]],
+names=[None, 'foo']
+)
+).rename(columns=dict(zip(range(3), ['a', None, 'c'])))
+t = pa.Table.from_pandas(df, preserve_index=True)
+raw_metadata = t.schema.metadata
+
+js = json.loads(raw_metadata[b'pandas'].decode('utf8'))
+
+col1, col2, col3, idx0, foo = js['columns']
+
+assert col1['name'] == col1['field_name']
+assert col2['name'] is None
+assert col2['field_name'] is None
 
 Review comment:
   Yes, the current code works without (just as it did work before with a 
"name" of None) as it is handled by `table_to_blockmanager`. 
   But that means that you will always have to special case this option, and 
for me that should be the point of "field_name" that 
``schema.get_field_index(field_name)`` is guaranteed to not error (and you thus 
don't have to special case None)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add field_name to pandas index metadata
> 
>
> Key: ARROW-1895
> URL: https://issues.apache.org/jira/browse/ARROW-1895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See the discussion here for details:
> https://github.com/pandas-dev/pandas/pull/18201
> In short we need a way to map index column names to field names in an arrow 
> Table.
> Additionally, we're depending on the index columns being written at the end 
> of the table and fixing this would allow us to read metadata written by other 
> systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1895) [Python] Add field_name to pandas index metadata

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281951#comment-16281951
 ] 

ASF GitHub Bot commented on ARROW-1895:
---

cpcloud commented on a change in pull request #1397: ARROW-1895: [Python] Add 
field_name to pandas index metadata
URL: https://github.com/apache/arrow/pull/1397#discussion_r155542129
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -160,9 +160,35 @@ def test_integer_index_column(self):
 df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')])
 _check_pandas_roundtrip(df, preserve_index=True)
 
+def test_index_metadata_field_name(self):
+df = pd.DataFrame(
+[(1, 'a', 3.1), (2, 'b', 2.2), (3, 'c', 1.3)],
+index=pd.MultiIndex.from_arrays(
+[['c', 'b', 'a'], [3, 2, 1]],
+names=[None, 'foo']
+)
+).rename(columns=dict(zip(range(3), ['a', None, 'c'])))
+t = pa.Table.from_pandas(df, preserve_index=True)
+raw_metadata = t.schema.metadata
+
+js = json.loads(raw_metadata[b'pandas'].decode('utf8'))
+
+col1, col2, col3, idx0, foo = js['columns']
+
+assert col1['name'] == col1['field_name']
+assert col2['name'] is None
+assert col2['field_name'] is None
 
 Review comment:
   Ok, I've implemented this. Pushing it up now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add field_name to pandas index metadata
> 
>
> Key: ARROW-1895
> URL: https://issues.apache.org/jira/browse/ARROW-1895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See the discussion here for details:
> https://github.com/pandas-dev/pandas/pull/18201
> In short we need a way to map index column names to field names in an arrow 
> Table.
> Additionally, we're depending on the index columns being written at the end 
> of the table and fixing this would allow us to read metadata written by other 
> systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1891) [Python] NaT date32 values are only converted to nulls if from_pandas is used

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281973#comment-16281973
 ] 

ASF GitHub Bot commented on ARROW-1891:
---

wesm closed pull request #1399: ARROW-1891: [Python] Always use NumPy NaT 
sentinels to mark nulls when converting to array
URL: https://github.com/apache/arrow/pull/1399
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 798822c1b..f21b40ed3 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -421,7 +421,11 @@ class NumPyConverter {
 using traits = internal::arrow_traits;
 
 const bool null_sentinels_possible =
-(use_pandas_null_sentinels_ && traits::supports_nulls);
+// NumPy has a NaT type
+(ArrowType::type_id == Type::TIMESTAMP || ArrowType::type_id == 
Type::DATE32) ||
+
+// Observing pandas's null sentinels
+((use_pandas_null_sentinels_ && traits::supports_nulls));
 
 if (mask_ != nullptr || null_sentinels_possible) {
   RETURN_NOT_OK(InitNullBitmap());
@@ -631,8 +635,6 @@ inline Status 
NumPyConverter::ConvertData(std::shared_ptr* d
 
   auto date_dtype = 
reinterpret_cast(dtype_->c_metadata);
   if (dtype_->type_num == NPY_DATETIME) {
-const int64_t null_count = ValuesToBitmap(arr_, 
null_bitmap_data_);
-
 // If we have inbound datetime64[D] data, this needs to be downcasted
 // separately here from int64_t to int32_t, because this data is not
 // supported in compute::Cast
@@ -642,6 +644,9 @@ inline Status 
NumPyConverter::ConvertData(std::shared_ptr* d
   Status s = StaticCastBuffer(**data, length_, pool_, 
data);
   RETURN_NOT_OK(s);
 } else {
+  // TODO(wesm): This is redundant, and recomputed in VisitNative()
+  const int64_t null_count = ValuesToBitmap(arr_, 
null_bitmap_data_);
+
   RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast(dtype_), 
&input_type));
   if (!input_type->Equals(*type_)) {
 RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, 
null_count,
diff --git a/python/pyarrow/tests/test_array.py 
b/python/pyarrow/tests/test_array.py
index a4d781a33..92562da14 100644
--- a/python/pyarrow/tests/test_array.py
+++ b/python/pyarrow/tests/test_array.py
@@ -495,6 +495,14 @@ def test_array_conversions_no_sentinel_values():
 assert arr3.null_count == 0
 
 
+def test_array_from_numpy_datetimeD():
+arr = np.array([None, datetime.date(2017, 4, 4)], dtype='datetime64[D]')
+
+result = pa.array(arr)
+expected = pa.array([None, datetime.date(2017, 4, 4)], type=pa.date32())
+assert result.equals(expected)
+
+
 def test_array_from_numpy_ascii():
 arr = np.array(['abcde', 'abc', ''], dtype='|S5')
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaT date32 values are only converted to nulls if from_pandas is used
> -
>
> Key: ARROW-1891
> URL: https://issues.apache.org/jira/browse/ARROW-1891
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {code}
> np.array([None, date(2017, 4, 4)], dtype='datetime64[D]')
> pa.array(expected, from_pandas=True) -> [null, 2017-4-4]
> pa.array(expected, from_pandas=True) -> [1970-1-1, 2017-4-4]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1891) [Python] NaT date32 values are only converted to nulls if from_pandas is used

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281971#comment-16281971
 ] 

ASF GitHub Bot commented on ARROW-1891:
---

wesm commented on issue #1399: ARROW-1891: [Python] Always use NumPy NaT 
sentinels to mark nulls when converting to array
URL: https://github.com/apache/arrow/pull/1399#issuecomment-349993014
 
 
   +1, appveyor build: https://ci.appveyor.com/project/wesm/arrow/build/1.0.1561


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaT date32 values are only converted to nulls if from_pandas is used
> -
>
> Key: ARROW-1891
> URL: https://issues.apache.org/jira/browse/ARROW-1891
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {code}
> np.array([None, date(2017, 4, 4)], dtype='datetime64[D]')
> pa.array(expected, from_pandas=True) -> [null, 2017-4-4]
> pa.array(expected, from_pandas=True) -> [1970-1-1, 2017-4-4]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1891) [Python] NaT date32 values are only converted to nulls if from_pandas is used

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1891.
-
Resolution: Fixed

Issue resolved by pull request 1399
[https://github.com/apache/arrow/pull/1399]

> [Python] NaT date32 values are only converted to nulls if from_pandas is used
> -
>
> Key: ARROW-1891
> URL: https://issues.apache.org/jira/browse/ARROW-1891
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {code}
> np.array([None, date(2017, 4, 4)], dtype='datetime64[D]')
> pa.array(expected, from_pandas=True) -> [null, 2017-4-4]
> pa.array(expected, from_pandas=True) -> [1970-1-1, 2017-4-4]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1897:

Fix Version/s: (was: 0.9.0)
   0.8.0

> Incorrect numpy_type for pandas metadata of Categoricals
> 
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>Assignee: Phillip Cloud
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.8.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1895) [Python] Add field_name to pandas index metadata

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282055#comment-16282055
 ] 

ASF GitHub Bot commented on ARROW-1895:
---

jorisvandenbossche commented on a change in pull request #1397: ARROW-1895: 
[Python] Add field_name to pandas index metadata
URL: https://github.com/apache/arrow/pull/1397#discussion_r155562651
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -160,9 +160,40 @@ def test_integer_index_column(self):
 df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')])
 _check_pandas_roundtrip(df, preserve_index=True)
 
+def test_index_metadata_field_name(self):
+# test None case, and strangely named non-index columns
+df = pd.DataFrame(
+[(1, 'a', 3.1), (2, 'b', 2.2), (3, 'c', 1.3)],
+index=pd.MultiIndex.from_arrays(
+[['c', 'b', 'a'], [3, 2, 1]],
+names=[None, 'foo']
+)
+).rename(columns=dict(zip(range(3), ['a', None, '__index_level_0__'])))
 
 Review comment:
   not that important, but doing `columns= ['a', None, '__index_level_0__']` 
inside the `DataFrame` call is a bit simpler


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add field_name to pandas index metadata
> 
>
> Key: ARROW-1895
> URL: https://issues.apache.org/jira/browse/ARROW-1895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See the discussion here for details:
> https://github.com/pandas-dev/pandas/pull/18201
> In short we need a way to map index column names to field names in an arrow 
> Table.
> Additionally, we're depending on the index columns being written at the end 
> of the table and fixing this would allow us to read metadata written by other 
> systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1893) [Python] test_primitive_serialization fails on Python 2.7.3

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282056#comment-16282056
 ] 

ASF GitHub Bot commented on ARROW-1893:
---

wesm commented on issue #1398: ARROW-1893: [Python] Convert memoryview to bytes 
when loading from pickle in Python 2.7
URL: https://github.com/apache/arrow/pull/1398#issuecomment-350011614
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] test_primitive_serialization fails on Python 2.7.3
> ---
>
> Key: ARROW-1893
> URL: https://issues.apache.org/jira/browse/ARROW-1893
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {{test_primitive_serialization}} fails on with the following error: Python 
> 2.7.3
> {code}
> str = 
>  
>  def loads(str):
>  >   file = StringIO(str)
> E   TypeError: expected read buffer, memoryview found
> {code}
> More context:
> {code}
>  def test_primitive_serialization(large_memory_map):
>  with pa.memory_map(large_memory_map, mode="r+") as mmap:
>  for obj in PRIMITIVE_OBJECTS:
>  serialization_roundtrip(obj, mmap)
> >   serialization_roundtrip(obj, mmap, 
> > pa.pandas_serialization_context)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1893) [Python] test_primitive_serialization fails on Python 2.7.3

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1893.
-
Resolution: Fixed

Issue resolved by pull request 1398
[https://github.com/apache/arrow/pull/1398]

> [Python] test_primitive_serialization fails on Python 2.7.3
> ---
>
> Key: ARROW-1893
> URL: https://issues.apache.org/jira/browse/ARROW-1893
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {{test_primitive_serialization}} fails on with the following error: Python 
> 2.7.3
> {code}
> str = 
>  
>  def loads(str):
>  >   file = StringIO(str)
> E   TypeError: expected read buffer, memoryview found
> {code}
> More context:
> {code}
>  def test_primitive_serialization(large_memory_map):
>  with pa.memory_map(large_memory_map, mode="r+") as mmap:
>  for obj in PRIMITIVE_OBJECTS:
>  serialization_roundtrip(obj, mmap)
> >   serialization_roundtrip(obj, mmap, 
> > pa.pandas_serialization_context)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1893) [Python] test_primitive_serialization fails on Python 2.7.3

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282058#comment-16282058
 ] 

ASF GitHub Bot commented on ARROW-1893:
---

wesm closed pull request #1398: ARROW-1893: [Python] Convert memoryview to 
bytes when loading from pickle in Python 2.7
URL: https://github.com/apache/arrow/pull/1398
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py
index 866cbdd96..1b19ca0e4 100644
--- a/python/pyarrow/compat.py
+++ b/python/pyarrow/compat.py
@@ -70,7 +70,7 @@ class Categorical(ClassPlaceholder):
 
 
 if PY2:
-import cPickle
+import cPickle as builtin_pickle
 
 try:
 from cdecimal import Decimal
@@ -107,6 +107,8 @@ def frombytes(o):
 def unichar(s):
 return unichr(s)
 else:
+import pickle as builtin_pickle
+
 unicode_type = str
 def lzip(*x):
 return list(zip(*x))
diff --git a/python/pyarrow/serialization.py b/python/pyarrow/serialization.py
index b6d2b0258..3059dfc1b 100644
--- a/python/pyarrow/serialization.py
+++ b/python/pyarrow/serialization.py
@@ -16,18 +16,19 @@
 # under the License.
 
 from collections import OrderedDict, defaultdict
+import six
 import sys
-import pickle
 
 import numpy as np
 
 from pyarrow import serialize_pandas, deserialize_pandas
+from pyarrow.compat import builtin_pickle
 from pyarrow.lib import _default_serialization_context, frombuffer
 
 try:
 import cloudpickle
 except ImportError:
-cloudpickle = pickle
+cloudpickle = builtin_pickle
 
 
 # --
@@ -44,12 +45,16 @@ def _deserialize_numpy_array_list(data):
 
 
 def _pickle_to_buffer(x):
-pickled = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
+pickled = builtin_pickle.dumps(x, protocol=builtin_pickle.HIGHEST_PROTOCOL)
 return frombuffer(pickled)
 
 
 def _load_pickle_from_buffer(data):
-return pickle.loads(memoryview(data))
+as_memoryview = memoryview(data)
+if six.PY2:
+return builtin_pickle.loads(as_memoryview.tobytes())
+else:
+return builtin_pickle.loads(as_memoryview)
 
 
 _serialize_numpy_array_pickle = _pickle_to_buffer
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index d17d89e24..2543e7d17 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -1570,19 +1570,21 @@ def 
test_backwards_compatible_index_multi_level_some_named():
 tm.assert_frame_equal(result, expected)
 
 
-@pytest.mark.parametrize('precision', range(1, 39))
-def test_decimal_roundtrip(tmpdir, precision):
+def test_decimal_roundtrip(tmpdir):
 num_values = 10
 
 columns = {}
 
-for scale in range(0, precision + 1):
-with util.random_seed(0):
-random_decimal_values = [
-util.randdecimal(precision, scale) for _ in range(num_values)
-]
-column_name = 'dec_precision_{:d}_scale_{:d}'.format(precision, scale)
-columns[column_name] = random_decimal_values
+for precision in range(1, 39):
+for scale in range(0, precision + 1):
+with util.random_seed(0):
+random_decimal_values = [
+util.randdecimal(precision, scale)
+for _ in range(num_values)
+]
+column_name = ('dec_precision_{:d}_scale_{:d}'
+   .format(precision, scale))
+columns[column_name] = random_decimal_values
 
 expected = pd.DataFrame(columns)
 filename = tmpdir.join('decimals.parquet')


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] test_primitive_serialization fails on Python 2.7.3
> ---
>
> Key: ARROW-1893
> URL: https://issues.apache.org/jira/browse/ARROW-1893
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {{test_primitive_serialization}} fails on with the following error: Python 
> 2.7.3
> {code}
> str = 
>  
>  def loads(str):
>  >   file = StringIO(str)
> E   TypeError: expected read buffer, memoryview found
> {code}
> More context:
> {code}
>  def test_p

[jira] [Commented] (ARROW-1895) [Python] Add field_name to pandas index metadata

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282063#comment-16282063
 ] 

ASF GitHub Bot commented on ARROW-1895:
---

cpcloud commented on a change in pull request #1397: ARROW-1895: [Python] Add 
field_name to pandas index metadata
URL: https://github.com/apache/arrow/pull/1397#discussion_r155563908
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -160,9 +160,40 @@ def test_integer_index_column(self):
 df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')])
 _check_pandas_roundtrip(df, preserve_index=True)
 
+def test_index_metadata_field_name(self):
+# test None case, and strangely named non-index columns
+df = pd.DataFrame(
+[(1, 'a', 3.1), (2, 'b', 2.2), (3, 'c', 1.3)],
+index=pd.MultiIndex.from_arrays(
+[['c', 'b', 'a'], [3, 2, 1]],
+names=[None, 'foo']
+)
+).rename(columns=dict(zip(range(3), ['a', None, '__index_level_0__'])))
 
 Review comment:
   Yep, thank you. That is much better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add field_name to pandas index metadata
> 
>
> Key: ARROW-1895
> URL: https://issues.apache.org/jira/browse/ARROW-1895
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> See the discussion here for details:
> https://github.com/pandas-dev/pandas/pull/18201
> In short we need a way to map index column names to field names in an arrow 
> Table.
> Additionally, we're depending on the index columns being written at the end 
> of the table and fixing this would allow us to read metadata written by other 
> systems (e.g., fastparquet) that don't make this assumption.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1897) [Python] Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1897:

Summary: [Python] Incorrect numpy_type for pandas metadata of Categoricals  
(was: Incorrect numpy_type for pandas metadata of Categoricals)

> [Python] Incorrect numpy_type for pandas metadata of Categoricals
> -
>
> Key: ARROW-1897
> URL: https://issues.apache.org/jira/browse/ARROW-1897
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Tom Augspurger
>Assignee: Phillip Cloud
>  Labels: categorical, metadata, pandas, parquet, pyarrow
> Fix For: 0.8.0
>
>
> If I'm reading 
> http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  correctly, the "numpy_type" field of a `Categorical` should be the storage 
> type used for the *codes*. It looks like pyarrow is just using 'object' 
> always.
> {code}
> In [1]: import pandas as pd
> In [2]: import pyarrow as pa
> In [3]: import pyarrow.parquet as pq
> In [4]: import io
> In [5]: import json
> In [6]: df = pd.DataFrame({"A": [1, 2]},
>...:   index=pd.CategoricalIndex(['one', 'two'], 
> name='idx'))
>...:
> In [8]: sink = io.BytesIO()
>...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>...: 
> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
>...:
> Out[8]:
> {'field_name': '__index_level_0__',
>  'metadata': {'num_categories': 2, 'ordered': False},
>  'name': 'idx',
>  'numpy_type': 'object',
>  'pandas_type': 'categorical'}
> {code}
> From the spec:
> bq. The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.
> So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1898) [JS] Update Flatbuffers per metadata changes in ARROW-1785

2017-12-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1898:
---

 Summary: [JS] Update Flatbuffers per metadata changes in ARROW-1785
 Key: ARROW-1898
 URL: https://issues.apache.org/jira/browse/ARROW-1898
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Wes McKinney
 Fix For: 0.8.0


[~paul.e.taylor] or [~bhulette] can you take a look at this? We should also 
remove the V3 metadata backwards compatibility hack as part of this. We can 
release JS again after the main 0.8.0 release is out



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1899) [Python] Refactor handling of null sentinels in python/numpy_to_arrow.cc

2017-12-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1899:
---

 Summary: [Python] Refactor handling of null sentinels in 
python/numpy_to_arrow.cc
 Key: ARROW-1899
 URL: https://issues.apache.org/jira/browse/ARROW-1899
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


See comments in 
https://github.com/apache/arrow/commit/ad30138a0ec9be3dfb179d1e9425a4502d556085 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1894:

Fix Version/s: 0.9.0

> [Python] Treat CPython memoryview or buffer objects equivalently to 
> pyarrow.Buffer in pyarrow.serialize
> ---
>
> Key: ARROW-1894
> URL: https://issues.apache.org/jira/browse/ARROW-1894
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> These should be treated as Buffer-like on serialize. We should consider how 
> to "box" the buffers as the appropriate kind of object (Buffer, memoryview, 
> etc.) when being deserialized



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1875) Write 64-bit ints as strings in integration test JSON files

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1875:

Fix Version/s: 0.9.0

> Write 64-bit ints as strings in integration test JSON files
> ---
>
> Key: ARROW-1875
> URL: https://issues.apache.org/jira/browse/ARROW-1875
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Brian Hulette
>Priority: Minor
> Fix For: 0.9.0
>
>
> Javascript can't handle 64-bit integers natively, so writing them as strings 
> in the JSON would make implementing the integration tests a lot simpler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1870:

Fix Version/s: 0.9.0

> [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
> 
>
> Key: ARROW-1870
> URL: https://issues.apache.org/jira/browse/ARROW-1870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1886) [Python] Add function to "flatten" structs within tables

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1886:

Fix Version/s: 0.9.0

> [Python] Add function to "flatten" structs within tables
> 
>
> Key: ARROW-1886
> URL: https://issues.apache.org/jira/browse/ARROW-1886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> See discussion in https://issues.apache.org/jira/browse/ARROW-1873
> When a user has a struct column, it may be more efficient to flatten the 
> struct into multiple columns of the form {{struct_name.field_name}} for each 
> field in the struct. Then when you call {{to_pandas}}, Python dictionaries do 
> not have to be created, and the conversion will be much more efficient



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1848:

Fix Version/s: 0.9.0

> [Python] Add documentation examples for reading single Parquet files and 
> datasets from HDFS
> ---
>
> Key: ARROW-1848
> URL: https://issues.apache.org/jira/browse/ARROW-1848
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> see 
> https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1861) [Python] Fix up ASV setup, add developer instructions for writing new benchmarks and running benchmark suite locally

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1861:

Fix Version/s: 0.9.0

> [Python] Fix up ASV setup, add developer instructions for writing new 
> benchmarks and running benchmark suite locally
> 
>
> Key: ARROW-1861
> URL: https://issues.apache.org/jira/browse/ARROW-1861
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> We need to start writing more microbenchmarks as we go to prevent 
> unintentional performance regressions (this has been a constant thorn in my 
> side for years: 
> http://wesmckinney.com/blog/introducing-vbench-new-code-performance-analysis-and-monitoring-tool/).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1858) [Python] Add documentation about parquet.write_to_dataset and related methods

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1858:

Fix Version/s: 0.9.0

> [Python] Add documentation about parquet.write_to_dataset and related methods
> -
>
> Key: ARROW-1858
> URL: https://issues.apache.org/jira/browse/ARROW-1858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> See 
> https://stackoverflow.com/questions/47482434/can-pyarrow-write-multiple-parquet-files-to-a-folder-like-fastparquets-file-sch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1860) [C++] Add data structure to "stage" a sequence of IPC messages from in-memory data

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1860:

Fix Version/s: 0.9.0

> [C++] Add data structure to "stage" a sequence of IPC messages from in-memory 
> data
> --
>
> Key: ARROW-1860
> URL: https://issues.apache.org/jira/browse/ARROW-1860
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Currently, when you need to pre-allocate space for a record batch or a stream 
> (schema + dictionaries + record batches), you must make multiple passes over 
> the data structures of interest (and use e.g. {{MockOutputStream}} to compute 
> the size of the output buffer). It would be useful to make a single pass to 
> "prepare" the IPC payload for both sizing and writing to prevent having to 
> make multiple passes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1744) [Plasma] Provide TensorFlow operator to read tensors from plasma

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1744:

Fix Version/s: 0.9.0

> [Plasma] Provide TensorFlow operator to read tensors from plasma
> 
>
> Key: ARROW-1744
> URL: https://issues.apache.org/jira/browse/ARROW-1744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://www.tensorflow.org/extend/adding_an_op



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1774) [C++] Add "view" function to create zero-copy views for compatible types, if supported

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1774:

Fix Version/s: 0.9.0

> [C++] Add "view" function to create zero-copy views for compatible types, if 
> supported
> --
>
> Key: ARROW-1774
> URL: https://issues.apache.org/jira/browse/ARROW-1774
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Similar to NumPy's {{ndarray.view}}, but with the restriction that the input 
> and output types have the same physical Arrow memory layout. This might be as 
> simple as adding a "zero copy only" option to the existing {{Cast}} kernel



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1731:

Fix Version/s: 0.9.0

> [Python] Provide for selecting a subset of columns to convert in 
> RecordBatch/Table.from_pandas
> --
>
> Key: ARROW-1731
> URL: https://issues.apache.org/jira/browse/ARROW-1731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Currently it's all-or-nothing, and to do the subsetting in pandas incurs a 
> data copy. This would enable columns (by name or index) to be selected out 
> without additional data copying
> cc [~cpcloud] [~jreback]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1722) [C++] Add linting script to look for C++/CLI issues

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1722:

Fix Version/s: 0.9.0

> [C++] Add linting script to look for C++/CLI issues
> ---
>
> Key: ARROW-1722
> URL: https://issues.apache.org/jira/browse/ARROW-1722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> This includes:
> * Using {{nullptr}} in header files (we must instead use an appropriate macro 
> to use {{__nullptr}} when the host compiler is C++/CLI)
> * Including {{}} in a public header (e.g. header files without "impl" 
> or "internal" in their name)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1715) [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, Table

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1715:

Fix Version/s: 0.9.0

> [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, 
> Table
> ---
>
> Key: ARROW-1715
> URL: https://issues.apache.org/jira/browse/ARROW-1715
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1712) [C++] Add method to BinaryBuilder to reserve space for value data

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1712:

Fix Version/s: 0.9.0

> [C++] Add method to BinaryBuilder to reserve space for value data
> -
>
> Key: ARROW-1712
> URL: https://issues.apache.org/jira/browse/ARROW-1712
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> The {{Resize}} and {{Reserve}} methods only reserve space for the value 
> offsets. When building binary/string arrays with a known size (or some 
> reasonable estimate), it would be more efficient to reserve once at the 
> beginning to prevent internal reallocations



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1706) [Python] StructArray.from_arrays should handle sequences that are coercible to arrays

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1706:

Fix Version/s: 0.9.0

> [Python] StructArray.from_arrays should handle sequences that are coercible 
> to arrays
> -
>
> Key: ARROW-1706
> URL: https://issues.apache.org/jira/browse/ARROW-1706
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Currently the arrays passed must be `pyarrow.Array` objects already.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1669) [C++] Consider adding Abseil (Google C++11 standard library extensions) to toolchain

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1669:

Fix Version/s: 0.9.0

> [C++] Consider adding Abseil (Google C++11 standard library extensions) to 
> toolchain
> 
>
> Key: ARROW-1669
> URL: https://issues.apache.org/jira/browse/ARROW-1669
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Google has released a library of C++11-compliant extensions to the STL that 
> may help make a lot of Arrow code simpler:
> https://github.com/abseil/abseil-cpp/
> This code is not header-only and so would require some effort to add to the 
> toolchain at the moment since it only supports the Bazel build system



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1696) [C++] Add codec benchmarks

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1696:

Fix Version/s: 0.9.0

> [C++] Add codec benchmarks
> --
>
> Key: ARROW-1696
> URL: https://issues.apache.org/jira/browse/ARROW-1696
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> This will also help users validate in release builds that the compression 
> libraries have been built with the appropriate optimization levels, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1692:

Fix Version/s: 0.9.0

> [Python, Java] UnionArray round trip not working
> 
>
> Key: ARROW-1692
> URL: https://issues.apache.org/jira/browse/ARROW-1692
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
> Fix For: 0.9.0
>
> Attachments: union_array.arrow
>
>
> I'm currently working on making pyarrow.serialization data available from the 
> Java side, one problem I was running into is that it seems the Java 
> implementation cannot read UnionArrays generated from C++. To make this 
> easily reproducible I created a clean Python implementation for creating 
> UnionArrays: https://github.com/apache/arrow/pull/1216
> The data is generated with the following script:
> {code}
> import pyarrow as pa
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
> int64 = pa.array([1, 2, 3], type='int64')
> types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
> batch = pa.RecordBatch.from_arrays([result], ["test"])
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> sink.close()
> b = sink.get_result()
> with open("union_array.arrow", "wb") as f:
> f.write(b)
> # Sanity check: Read the batch in again
> with open("union_array.arrow", "rb") as f:
> b = f.read()
> reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
> batch = reader.read_next_batch()
> print("union array is", batch.column(0))
> {code}
> I attached the file generated by that script. Then when I run the following 
> code in Java:
> {code}
> RootAllocator allocator = new RootAllocator(10);
> ByteArrayInputStream in = new 
> ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
> ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
> reader.loadNextBatch()
> {code}
> I get the following error:
> {code}
> |  java.lang.IllegalArgumentException thrown: Could not load buffers for 
> field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error 
> message: can not truncate buffer to a larger size 7: 0
> |at VectorLoader.loadBuffers (VectorLoader.java:83)
> |at VectorLoader.load (VectorLoader.java:62)
> |at ArrowReader$1.visit (ArrowReader.java:125)
> |at ArrowReader$1.visit (ArrowReader.java:111)
> |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> |at ArrowReader.loadNextBatch (ArrowReader.java:137)
> |at (#7:1)
> {code}
> It seems like Java is not picking up that the UnionArray is Dense instead of 
> Sparse. After changing the default in 
> java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, 
> I get this:
> {code}
> jshell> reader.getVectorSchemaRoot().getSchema()
> $9 ==> Schema [0])<: Int(64, true)>
> {code}
> but then reading doesn't work:
> {code}
> jshell> reader.loadNextBatch()
> |  java.lang.IllegalArgumentException thrown: Could not load buffers for 
> field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0
> |at VectorLoader.loadBuffers (VectorLoader.java:83)
> |at VectorLoader.load (VectorLoader.java:62)
> |at ArrowReader$1.visit (ArrowReader.java:125)
> |at ArrowReader$1.visit (ArrowReader.java:111)
> |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> |at ArrowReader.loadNextBatch (ArrowReader.java:137)
> |at (#8:1)
> {code}
> Any help with this is appreciated!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1705) [Python] Create StructArray (+ type inference) from sequence of dicts

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1705:

Fix Version/s: 0.9.0

> [Python] Create StructArray (+ type inference) from sequence of dicts
> -
>
> Key: ARROW-1705
> URL: https://issues.apache.org/jira/browse/ARROW-1705
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> See https://github.com/apache/arrow/issues/1217



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1645) Access HDFS with read_table() automatically

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1645:

Fix Version/s: 0.9.0

> Access HDFS with read_table() automatically
> ---
>
> Key: ARROW-1645
> URL: https://issues.apache.org/jira/browse/ARROW-1645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ehsan Totoni
> Fix For: 0.9.0
>
>
> t'd be great to support accessing HDFS automatically like: 
> `pq.read_table('hdfs://example.parquet'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1644:

Fix Version/s: 0.9.0

> [Python] Read and write nested Parquet data with a mix of struct and list 
> nesting levels
> 
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
> Fix For: 0.9.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1645) Access HDFS with read_table() automatically

2017-12-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282084#comment-16282084
 ] 

Wes McKinney commented on ARROW-1645:
-

Could you submit a patch for this?

> Access HDFS with read_table() automatically
> ---
>
> Key: ARROW-1645
> URL: https://issues.apache.org/jira/browse/ARROW-1645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ehsan Totoni
> Fix For: 0.9.0
>
>
> t'd be great to support accessing HDFS automatically like: 
> `pq.read_table('hdfs://example.parquet'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1643:

Fix Version/s: 0.9.0

> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1623) [C++] Add convenience method to construct Buffer from a string that owns its memory

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1623:

Fix Version/s: 0.9.0

> [C++] Add convenience method to construct Buffer from a string that owns its 
> memory
> ---
>
> Key: ARROW-1623
> URL: https://issues.apache.org/jira/browse/ARROW-1623
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> The memory would need to be allocated from a memory pool / buffer allocator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (ARROW-1645) Access HDFS with read_table() automatically

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1645.
---
Resolution: Duplicate

Duplicate of ARROW-1643

> Access HDFS with read_table() automatically
> ---
>
> Key: ARROW-1645
> URL: https://issues.apache.org/jira/browse/ARROW-1645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ehsan Totoni
> Fix For: 0.9.0
>
>
> t'd be great to support accessing HDFS automatically like: 
> `pq.read_table('hdfs://example.parquet'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1599) PyArrow unable to read Parquet files with vector as column

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1599:

Fix Version/s: 0.9.0

> PyArrow unable to read Parquet files with vector as column
> --
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
> Fix For: 0.9.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1639) [Python] More efficient serialization for RangeIndex in serialize_pandas

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1639:

Fix Version/s: 0.9.0

> [Python] More efficient serialization for RangeIndex in serialize_pandas
> 
>
> Key: ARROW-1639
> URL: https://issues.apache.org/jira/browse/ARROW-1639
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1579) Add dockerized test setup to validate Spark integration

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1579:

Fix Version/s: 0.9.0

> Add dockerized test setup to validate Spark integration
> ---
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1579) Add dockerized test setup to validate Spark integration

2017-12-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282092#comment-16282092
 ] 

Wes McKinney commented on ARROW-1579:
-

After 0.8.0 settles it would be great to have this set up, maybe we can figure 
out a place to run nightlies and send them to some e-mail list where we can 
check them

> Add dockerized test setup to validate Spark integration
> ---
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1572) [C++] Implement "value counts" kernels for tabulating value frequencies

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1572:

Fix Version/s: 0.9.0

> [C++] Implement "value counts" kernels for tabulating value frequencies
> ---
>
> Key: ARROW-1572
> URL: https://issues.apache.org/jira/browse/ARROW-1572
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.9.0
>
>
> This is related to "match", "isin", and "unique" since hashing is generally 
> required



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-40) C++: Reinterpret Struct arrays as tables

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-40:
--
Fix Version/s: 0.9.0

> C++: Reinterpret Struct arrays as tables
> 
>
> Key: ARROW-40
> URL: https://issues.apache.org/jira/browse/ARROW-40
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> This is mostly a question of layering container types, but will be provided 
> as an API convenience. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-41) C++: Convert table to std::vector of Struct arrays

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-41:
--
Fix Version/s: 0.9.0

> C++: Convert table to std::vector of Struct arrays
> --
>
> Key: ARROW-41
> URL: https://issues.apache.org/jira/browse/ARROW-41
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> This may require memory allocation depending on the chunking of the table 
> columns. 
> While tables and struct type columns are semantically equivalent (and tables 
> can be embedded in other tables using struct types), the memory layout of a 
> table may not be strictly contiguous. For the purposes of putting data on the 
> wire / in shared memory, it may be useful to offer a conversion function to 
> "structify" an in-memory logical Arrow table. See ARROW-24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-640) [Python] Arrow types should have a sensible __hash__ and comparison

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-640:
---
Fix Version/s: 0.9.0

> [Python] Arrow types should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
> Fix For: 0.9.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-640:
---
Summary: [Python] Arrow scalar values should have a sensible __hash__ and 
comparison  (was: [Python] Arrow types should have a sensible __hash__ and 
comparison)

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
> Fix For: 0.9.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-554) [C++] Implement functions to conform unequal dictionaries amongst multiple Arrow arrays

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-554:
---
Fix Version/s: 0.9.0

> [C++] Implement functions to conform unequal dictionaries amongst multiple 
> Arrow arrays
> ---
>
> Key: ARROW-554
> URL: https://issues.apache.org/jira/browse/ARROW-554
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.9.0
>
>
> We may wish to either
> * Conform the dictionary indices to reference a common dictionary
> * Concatenate indices into a new array with a common dictionary
> This is related to in-memory dictionary encoding, as you start with a 
> partially-built dictionary and then add entries as you observe new ones in 
> other dictionaries, all the while "rebasing" indices to consistently 
> reference the same dictionary at the end



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-973) [Website] Add FAQ page about project

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-973:
---
Fix Version/s: (was: 1.0.0)
   0.9.0

> [Website] Add FAQ page about project
> 
>
> Key: ARROW-973
> URL: https://issues.apache.org/jira/browse/ARROW-973
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> As some suggested initial topics for the FAQ:
> * How Apache Arrow is related to Apache Parquet (the difference between a 
> "storage format" and an "in-memory format" causes confusion)
> * How is Arrow similar to / different from Flatbuffers and Cap'n Proto
> * How Arrow uses Flatbuffers (I have had people incorrectly state to me 
> things like "Arrow is just Flatbuffers under the hood")
> Any other ideas?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-530) C++/Python: Provide subpools for better memory allocation tracking

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-530:
---
Fix Version/s: 0.9.0

> C++/Python: Provide subpools for better memory allocation tracking
> --
>
> Key: ARROW-530
> URL: https://issues.apache.org/jira/browse/ARROW-530
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Uwe L. Korn
>  Labels: beginner, newbie
> Fix For: 0.9.0
>
>
> Currently we can only track the amount of bytes allocated by the main memory 
> pool or the alternative jemalloc implementation. To better understand certain 
> situation, we should provide a MemoryPool proxy implementation that tracks 
> only the amount of memory that was made through its direct calls but 
> delegates the actual allocation to an underlying pool.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1009) [C++] Create asynchronous version of StreamReader

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1009:

Fix Version/s: 0.9.0

> [C++] Create asynchronous version of StreamReader
> -
>
> Key: ARROW-1009
> URL: https://issues.apache.org/jira/browse/ARROW-1009
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> the {{AsyncStreamReader}} would buffer the next record batch in a background 
> thread, while emulating the current synchronous / blocking API



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1012) [C++] Create implementation of StreamReader that reads from Apache Parquet files

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1012:

Fix Version/s: 0.9.0

> [C++] Create implementation of StreamReader that reads from Apache Parquet 
> files
> 
>
> Key: ARROW-1012
> URL: https://issues.apache.org/jira/browse/ARROW-1012
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> This will be enabled by ARROW-1008



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-501) [C++] Implement concurrent / buffering InputStream for streaming data use cases

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-501:
---
Fix Version/s: 0.9.0

> [C++] Implement concurrent / buffering InputStream for streaming data use 
> cases
> ---
>
> Key: ARROW-501
> URL: https://issues.apache.org/jira/browse/ARROW-501
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Related to ARROW-500, when processing an input data stream, we may wish to 
> continue buffering input (up to an maximum buffer size) in between 
> synchronous Read calls



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1329) [C++] Define "virtual table" interface

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1329:

Fix Version/s: 1.0.0

> [C++] Define "virtual table" interface
> --
>
> Key: ARROW-1329
> URL: https://issues.apache.org/jira/browse/ARROW-1329
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 1.0.0
>
>
> The idea is that a virtual table may reference Arrow data that is not yet 
> available in memory. The implementation will define the semantics of how 
> columns are loaded into memory. 
> A virtual column interface will need to accompany this. For example:
> {code:language=c++}
> std::shared_ptr vtable = ...;
> std::shared_ptr vcolumn = vtable->column(i);
> std::shared_ptr = vcolumn->Materialize();
> std::shared_ptr = vtable->Materialize();
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1382) [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1382:

Fix Version/s: 0.9.0

> [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize
> ---
>
> Key: ARROW-1382
> URL: https://issues.apache.org/jira/browse/ARROW-1382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
> Fix For: 0.9.0
>
>
> If a Python object appears multiple times within a list/tuple/dictionary, 
> then when pyarrow serializes the object, it will duplicate the object many 
> times. This leads to a potentially huge expansion in the size of the object 
> (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 
> times bigger than it needs to be).
> {code}
> import pyarrow as pa
> l = [0]
> original_object = [l, l]
> # Serialize and deserialize the object.
> buf = pa.serialize(original_object).to_buffer()
> new_object = pa.deserialize(buf)
> # This works.
> assert original_object[0] is original_object[1]
> # This fails.
> assert new_object[0] is new_object[1]
> {code}
> One potential way to address this is to use the Arrow dictionary encoding.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-792) [Java] Allow loading/unloading vectors without using FieldNodes

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-792:
---
Summary: [Java] Allow loading/unloading vectors without using FieldNodes  
(was: Allow loading/unloading vectors without using FieldNodes)

> [Java] Allow loading/unloading vectors without using FieldNodes
> ---
>
> Key: ARROW-792
> URL: https://issues.apache.org/jira/browse/ARROW-792
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>
> The information stored in FieldNode structure is not strictly necessary for 
> serializing/deserializing vectors. We should allow loading/unloading of 
> vectors without it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1393) [C++] Simplified CUDA IPC writer and reader for communicating a CPU + GPU payload to another process

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1393:

Fix Version/s: 0.9.0

> [C++] Simplified CUDA IPC writer and reader for communicating a CPU + GPU 
> payload to another process
> 
>
> Key: ARROW-1393
> URL: https://issues.apache.org/jira/browse/ARROW-1393
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> The purpose of this would be to simplify transmission of a mixed-device 
> payload from one process to another. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-522) [Java] VectorLoader throws exception data schema contains list of maps.

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-522:
---
Summary: [Java] VectorLoader throws exception data schema contains list of 
maps.  (was: VectorLoader throws exception data schema contains list of maps.)

> [Java] VectorLoader throws exception data schema contains list of maps.
> ---
>
> Key: ARROW-522
> URL: https://issues.apache.org/jira/browse/ARROW-522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.1.0
>Reporter: Rock Wang
>Priority: Critical
>
> I encountered this exception
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: should have as 
> many children as in the schema: found 0 expected 2
> at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
> at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:91)
> at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:95)
> at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:69)
> {code}
> The test code is
> {code:java}
> public class ArrowTest {
> public static class ByteArrayReadableSeekableByteChannel implements 
> SeekableByteChannel {
> private byte[] byteArray;
> private int position = 0;
> public ByteArrayReadableSeekableByteChannel(byte[] byteArray) {
> if (byteArray == null) {
> throw new NullPointerException();
> }
> this.byteArray = byteArray;
> }
> @Override
> public boolean isOpen() {
> return byteArray != null;
> }
> @Override
> public void close() throws IOException {
> byteArray = null;
> }
> @Override
> public int read(final ByteBuffer dst) throws IOException {
> int remainingInBuf = byteArray.length - this.position;
> int length = Math.min(dst.remaining(), remainingInBuf);
> dst.put(this.byteArray, this.position, length);
> this.position += length;
> return length;
> }
> @Override
> public long position() throws IOException {
> return this.position;
> }
> @Override
> public SeekableByteChannel position(final long newPosition) throws 
> IOException {
> this.position = (int) newPosition;
> return this;
> }
> @Override
> public long size() throws IOException {
> return this.byteArray.length;
> }
> @Override
> public int write(final ByteBuffer src) throws IOException {
> throw new UnsupportedOperationException("Read only");
> }
> @Override
> public SeekableByteChannel truncate(final long size) throws 
> IOException {
> throw new UnsupportedOperationException("Read only");
> }
> }
> public static void main(String[] argv) throws Exception {
> ByteArrayOutputStream byteArrayOutputStream = new 
> ByteArrayOutputStream();
> // write
> try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> BufferAllocator originalVectorAllocator = allocator
> .newChildAllocator("child allocator", 1024, 
> Integer.MAX_VALUE);
> MapVector parent = new MapVector("parent", 
> originalVectorAllocator, null)
> ) {
> writeData(10, parent);
> write(parent.getChild("root"), 
> Channels.newChannel(byteArrayOutputStream));
> }
> byte[] data = byteArrayOutputStream.toByteArray();
> // read
> try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> BufferAllocator readerAllocator = 
> allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE);
> ArrowReader arrowReader = new ArrowReader(new 
> ByteArrayReadableSeekableByteChannel(data),
> readerAllocator);
> BufferAllocator vectorAllocator = 
> allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE);
> MapVector parent = new MapVector("parent", vectorAllocator, 
> null)
> ) {
> ArrowFooter footer = arrowReader.readFooter();
> Schema schema = footer.getSchema();
> NullableMapVector root = parent.addOrGet("root", 
> Types.MinorType.MAP, NullableMapVector.class);
> VectorLoader vectorLoader = new VectorLoader(schema, root);
> List recordBatches = footer.getRecordBatches();
> for (ArrowBlock rbBlock : recordBatches) {
> try (ArrowRecordBatch recordBatch = 
> arrowReader.readReco

[jira] [Updated] (ARROW-764) [C++] Improve performance of CopyBitmap, add benchmarks

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-764:
---
Fix Version/s: 0.9.0

> [C++] Improve performance of CopyBitmap, add benchmarks
> ---
>
> Key: ARROW-764
> URL: https://issues.apache.org/jira/browse/ARROW-764
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> This is follow up work after a discussion in the patch for ARROW-657



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-567) [C++] File and stream APIs for interacting with "large" schemas

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-567:
---
Fix Version/s: 0.9.0

> [C++] File and stream APIs for interacting with "large" schemas
> ---
>
> Key: ARROW-567
> URL: https://issues.apache.org/jira/browse/ARROW-567
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> For data where the metadata itself is large (> 1 fields), doing a full 
> in-memory reconstruction of a record batch may be impractical if the user's 
> goal is to do random access on a potentially small subset of a batch. 
> I propose adding an API that enables "cheap" inspection of the record batch 
> metadata and reconstruction of fields. 
> Because of the flattened buffer and field metadata, at the moment the 
> complexity of random field access will scale with the number of fields -- in 
> the future we may devise strategies to mitigate this (e.g. storing a 
> pre-computed buffer/field lookup table in the schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-976) [Python] Provide API for defining and reading Parquet datasets with more ad hoc partition schemes

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-976:
---
Fix Version/s: 0.9.0

> [Python] Provide API for defining and reading Parquet datasets with more ad 
> hoc partition schemes
> -
>
> Key: ARROW-976
> URL: https://issues.apache.org/jira/browse/ARROW-976
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-638) [Format] Add metadata for single and double precision complex numbers

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-638:
---
Fix Version/s: 1.0.0

> [Format] Add metadata for single and double precision complex numbers
> -
>
> Key: ARROW-638
> URL: https://issues.apache.org/jira/browse/ARROW-638
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 1.0.0
>
>
> Numerical computing libraries like NumPy and TensorFlow feature complex64 and 
> complex128 numbers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1424) [Python] Initial bindings for libarrow_gpu

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1424:

Fix Version/s: 0.9.0

> [Python] Initial bindings for libarrow_gpu
> --
>
> Key: ARROW-1424
> URL: https://issues.apache.org/jira/browse/ARROW-1424
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GPU, Python
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1329) [C++] Define "virtual table" interface

2017-12-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282175#comment-16282175
 ] 

Wes McKinney commented on ARROW-1329:
-

There has been partial progress toward this. It may make sense to add a pure 
virtual method to {{arrow::Table}} which ensures that all data is loaded into 
memory. This will allow different Table implementations to define their own 
logic for materialization

> [C++] Define "virtual table" interface
> --
>
> Key: ARROW-1329
> URL: https://issues.apache.org/jira/browse/ARROW-1329
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 1.0.0
>
>
> The idea is that a virtual table may reference Arrow data that is not yet 
> available in memory. The implementation will define the semantics of how 
> columns are loaded into memory. 
> A virtual column interface will need to accompany this. For example:
> {code:language=c++}
> std::shared_ptr vtable = ...;
> std::shared_ptr vcolumn = vtable->column(i);
> std::shared_ptr = vcolumn->Materialize();
> std::shared_ptr = vtable->Materialize();
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1501) [JS] JavaScript integration tests

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1501:

Fix Version/s: 0.9.0

> [JS] JavaScript integration tests
> -
>
> Key: ARROW-1501
> URL: https://issues.apache.org/jira/browse/ARROW-1501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
> Fix For: 0.9.0
>
>
> Tracking JIRA for integration test-related issues



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1570:

Fix Version/s: 0.9.0

> [C++] Define API for creating a kernel instance from function of scalar input 
> and output with a particular signature
> 
>
> Key: ARROW-1570
> URL: https://issues.apache.org/jira/browse/ARROW-1570
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.9.0
>
>
> This could include an {{std::function}} instance (but these cannot be inlined 
> by the C++ compiler), but should also permit use with inline-able functions 
> or functors



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1580) [Python] Instructions for setting up nightly builds on Linux

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1580:

Fix Version/s: 0.9.0

> [Python] Instructions for setting up nightly builds on Linux
> 
>
> Key: ARROW-1580
> URL: https://issues.apache.org/jira/browse/ARROW-1580
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
> Fix For: 0.9.0
>
>
> cc [~cpcloud]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1567) [C++] Implement "fill null" kernels that replace null values with some scalar replacement value

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1567:

Fix Version/s: 1.0.0

> [C++] Implement "fill null" kernels that replace null values with some scalar 
> replacement value
> ---
>
> Key: ARROW-1567
> URL: https://issues.apache.org/jira/browse/ARROW-1567
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1900) [C++] Add utility functions for determining value range (maximum and minimum) of integer arrays

2017-12-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1900:
---

 Summary: [C++] Add utility functions for determining value range 
(maximum and minimum) of integer arrays
 Key: ARROW-1900
 URL: https://issues.apache.org/jira/browse/ARROW-1900
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


These functions don't need to be kernels right away; they are useful internally 
for determining when a "small range" alternative to a hash table can be used 
for integer arrays. The maximum and minimum is determined in a single scan



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1569) [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1569:

Fix Version/s: 1.0.0

> [C++] Kernel functions for determining monotonicity (ascending or descending) 
> for well-ordered types
> 
>
> Key: ARROW-1569
> URL: https://issues.apache.org/jira/browse/ARROW-1569
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 1.0.0
>
>
> These kernels must offer some stateful variant so that monotonicity can be 
> determined across chunked arrays



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1560) [C++] Kernel implementations for "match" function

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1560:

Fix Version/s: 0.9.0

> [C++] Kernel implementations for "match" function
> -
>
> Key: ARROW-1560
> URL: https://issues.apache.org/jira/browse/ARROW-1560
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.9.0
>
>
> Match computes a position index array from an array values into a set of 
> categories
> {code}
> match(['a', 'b', 'a', null, 'b', 'a', 'b'], ['b', 'a'])
> return [1, 0, 1, null, 0, 1, 0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1561) [C++] Kernel implementations for "isin" (set containment)

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1561:

Fix Version/s: 0.9.0

> [C++] Kernel implementations for "isin" (set containment)
> -
>
> Key: ARROW-1561
> URL: https://issues.apache.org/jira/browse/ARROW-1561
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.9.0
>
>
> isin determines whether each element in the left array is contained in the 
> values in the right array. This function must handle the case where the right 
> array has nulls (so that null in the left array will return true)
> {code}
> isin(['a', 'b', null], ['a', 'c'])
> returns [true, false, null]
> isin(['a', 'b', null], ['a', 'c', null])
> returns [true, false, true]
> {code}
> May need an option to return false for null instead of null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1884) [C++] Make JsonReader/JsonWriter classes internal APIs

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282298#comment-16282298
 ] 

ASF GitHub Bot commented on ARROW-1884:
---

wesm opened a new pull request #1400: ARROW-1884: [C++] Exclude integration 
test JSON reader/writer classes from public API
URL: https://github.com/apache/arrow/pull/1400
 
 
   These were showing up in our Doxygen docs and may mislead users reading the 
public API into thinking these classes do something that they do not (they 
don't read general JSON)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make JsonReader/JsonWriter classes internal APIs
> --
>
> Key: ARROW-1884
> URL: https://issues.apache.org/jira/browse/ARROW-1884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> These are exposed in the public API in {{arrow::ipc}}, and could possibly 
> mislead users: http://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1884) [C++] Make JsonReader/JsonWriter classes internal APIs

2017-12-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1884:
--
Labels: pull-request-available  (was: )

> [C++] Make JsonReader/JsonWriter classes internal APIs
> --
>
> Key: ARROW-1884
> URL: https://issues.apache.org/jira/browse/ARROW-1884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> These are exposed in the public API in {{arrow::ipc}}, and could possibly 
> mislead users: http://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1873) [Python] Segmentation fault when loading total 2GB of parquet files

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1873:
---

Assignee: Wes McKinney

> [Python] Segmentation fault when loading total 2GB of parquet files
> ---
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: DB Tsai
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB. 
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to 
> list all the files, and then load them as pandas dataframe through pyarrow. 
> The schema of the parquet files is like 
> {code:java}
> root
>  |-- dateint: integer (nullable = true)
>  |-- profileid: long (nullable = true)
>  |-- time: long (nullable = true)
>  |-- label: double (nullable = true)
>  |-- weight: double (nullable = true)
>  |-- features: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when 
> loading 100 of them, we got segmentation fault as the following. FYI, if we 
> flatten {{features: array[double]}} into top level, the file sizes are around 
> the same, and work fine too. 
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0  0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1  0x72700b5a in 
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object*, _object**) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2  0x72714985 in arrow::Status 
> arrow::py::ConvertListsLike(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object**) () from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3  0x72716b92 in 
> arrow::py::ObjectBlock::Write(std::shared_ptr const&, long, 
> long) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4  0x7270a489 in 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
>  const ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5  0x7270a67c in std::thread::_Impl arrow::ParallelFor(int,
>  int, 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
>  ()> >::_M_run() ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6  0x71e30c5c in std::execute_native_thread_routine_compat 
> (__p=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7  0x77bc16ba in start_thread (arg=0x7fffe3392700) at 
> pthread_create.c:333
> #8  0x778f73dd in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1873) [Python] Segmentation fault when loading total 2GB of parquet files

2017-12-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282316#comment-16282316
 ] 

Wes McKinney commented on ARROW-1873:
-

I'm going to add a few missing null checks to help catch the OOM, will put up 
patch soon. I'm not sure how else to help with this

> [Python] Segmentation fault when loading total 2GB of parquet files
> ---
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: DB Tsai
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB. 
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to 
> list all the files, and then load them as pandas dataframe through pyarrow. 
> The schema of the parquet files is like 
> {code:java}
> root
>  |-- dateint: integer (nullable = true)
>  |-- profileid: long (nullable = true)
>  |-- time: long (nullable = true)
>  |-- label: double (nullable = true)
>  |-- weight: double (nullable = true)
>  |-- features: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when 
> loading 100 of them, we got segmentation fault as the following. FYI, if we 
> flatten {{features: array[double]}} into top level, the file sizes are around 
> the same, and work fine too. 
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0  0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1  0x72700b5a in 
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object*, _object**) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2  0x72714985 in arrow::Status 
> arrow::py::ConvertListsLike(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object**) () from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3  0x72716b92 in 
> arrow::py::ObjectBlock::Write(std::shared_ptr const&, long, 
> long) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4  0x7270a489 in 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
>  const ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5  0x7270a67c in std::thread::_Impl arrow::ParallelFor(int,
>  int, 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
>  ()> >::_M_run() ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6  0x71e30c5c in std::execute_native_thread_routine_compat 
> (__p=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7  0x77bc16ba in start_thread (arg=0x7fffe3392700) at 
> pthread_create.c:333
> #8  0x778f73dd in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1901) [Python] Support recursive mkdir for DaskFilesystem

2017-12-07 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-1901:
--

Assignee: Uwe L. Korn

> [Python] Support recursive mkdir for DaskFilesystem
> ---
>
> Key: ARROW-1901
> URL: https://issues.apache.org/jira/browse/ARROW-1901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1901) [Python] Support recursive mkdir for DaskFilesystem

2017-12-07 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-1901:
--

 Summary: [Python] Support recursive mkdir for DaskFilesystem
 Key: ARROW-1901
 URL: https://issues.apache.org/jira/browse/ARROW-1901
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe L. Korn
 Fix For: 0.8.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1901) [Python] Support recursive mkdir for DaskFilesystem

2017-12-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1901:
--
Labels: pull-request-available  (was: )

> [Python] Support recursive mkdir for DaskFilesystem
> ---
>
> Key: ARROW-1901
> URL: https://issues.apache.org/jira/browse/ARROW-1901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1901) [Python] Support recursive mkdir for DaskFilesystem

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282452#comment-16282452
 ] 

ASF GitHub Bot commented on ARROW-1901:
---

xhochy opened a new pull request #1401: ARROW-1901: [Python] Support recursive 
mkdir for DaskFilesystem
URL: https://github.com/apache/arrow/pull/1401
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support recursive mkdir for DaskFilesystem
> ---
>
> Key: ARROW-1901
> URL: https://issues.apache.org/jira/browse/ARROW-1901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1903) [JS] Fix typings consuming apache-arrow module when noImplicitAny is false

2017-12-07 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-1903:
--

 Summary: [JS] Fix typings consuming apache-arrow module when 
noImplicitAny is false
 Key: ARROW-1903
 URL: https://issues.apache.org/jira/browse/ARROW-1903
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.8.0
Reporter: Paul Taylor
Assignee: Paul Taylor


The TypeScript compiler has a few bugs that raise compiler errors when valid 
strict-mode code is compiled with some of the strict mode-settings disabled. 
Since we ship the TS source code in the main `apache-arrow` npm module, 
consumers will encounter the following TypeScript compiler errors under these 
conditions:

{code}
# --strictNullChecks=true, --noImplicitAny=false
vector/numeric.ts(57,17): error TS2322: Type 'number' is not assignable to type 
'never'.
vector/numeric.ts(61,35): error TS2322: Type 'number' is not assignable to type 
'never'.
vector/numeric.ts(63,18): error TS2322: Type '0' is not assignable to type 
'never'.
vector/virtual.ts(98,38): error TS2345: Argument of type 'TypedArray' is not 
assignable to parameter of type 'never'.
{code}

The fixes are minor, and I'll add a step in the unit tests to validate the 
build targets compile with different compilation flags than ours.

Related:
https://github.com/ReactiveX/IxJS/pull/167
https://github.com/Microsoft/TypeScript/issues/20299



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1902) [Python] Remove mkdir race condition from write_to_dataset

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282464#comment-16282464
 ] 

ASF GitHub Bot commented on ARROW-1902:
---

xhochy opened a new pull request #1402: ARROW-1902: [Python] Remove mkdir race 
condition from write_to_dataset
URL: https://github.com/apache/arrow/pull/1402
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Remove mkdir race condition from write_to_dataset 
> ---
>
> Key: ARROW-1902
> URL: https://issues.apache.org/jira/browse/ARROW-1902
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> If two processes create the same directory tree, one of them might see that a 
> directory does not exist but before the actual call to {{mkdir}} is done, the 
> second process already created the directory. In this case the former process 
> will raise an exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1902) [Python] Remove mkdir race condition from write_to_dataset

2017-12-07 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-1902:
--

 Summary: [Python] Remove mkdir race condition from 
write_to_dataset 
 Key: ARROW-1902
 URL: https://issues.apache.org/jira/browse/ARROW-1902
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.8.0


If two processes create the same directory tree, one of them might see that a 
directory does not exist but before the actual call to {{mkdir}} is done, the 
second process already created the directory. In this case the former process 
will raise an exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1902) [Python] Remove mkdir race condition from write_to_dataset

2017-12-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1902:
--
Labels: pull-request-available  (was: )

> [Python] Remove mkdir race condition from write_to_dataset 
> ---
>
> Key: ARROW-1902
> URL: https://issues.apache.org/jira/browse/ARROW-1902
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> If two processes create the same directory tree, one of them might see that a 
> directory does not exist but before the actual call to {{mkdir}} is done, the 
> second process already created the directory. In this case the former process 
> will raise an exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1904) [C++] Using the raw_values() method on arrow::PrimitiveArray yields unreliable results on some compilers

2017-12-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1904:
---

 Summary: [C++] Using the raw_values() method on 
arrow::PrimitiveArray yields unreliable results on some compilers
 Key: ARROW-1904
 URL: https://issues.apache.org/jira/browse/ARROW-1904
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.8.0


I ran into an odd issue where, even though I was casting to 
{{arrow::PrimitiveArray}}, it picked up the {{raw_values}} method from a 
subclass of {{PrimitiveArray}} (which includes a slice offset)

{code}
(gdb) p reinterpret_cast(reinterpret_cast(arr).raw_values())[0]
$9 = 25
(gdb) p reinterpret_cast(reinterpret_cast(arr).raw_values_)[0]
$10 = 10
(gdb) p arr.offset()
$11 = 15
{code}

I think the {{raw_values}} method in PrimitiveArray should be deprecated and 
removed, since it is dangerous to use as it does not include a slice offset, if 
any



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1884) [C++] Make JsonReader/JsonWriter classes internal APIs

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1884.
-
Resolution: Fixed

Issue resolved by pull request 1400
[https://github.com/apache/arrow/pull/1400]

> [C++] Make JsonReader/JsonWriter classes internal APIs
> --
>
> Key: ARROW-1884
> URL: https://issues.apache.org/jira/browse/ARROW-1884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> These are exposed in the public API in {{arrow::ipc}}, and could possibly 
> mislead users: http://arrow.apache.org/docs/cpp/namespacearrow_1_1ipc.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1884) [C++] Make JsonReader/JsonWriter classes internal APIs

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282514#comment-16282514
 ] 

ASF GitHub Bot commented on ARROW-1884:
---

wesm closed pull request #1400: ARROW-1884: [C++] Exclude integration test JSON 
reader/writer classes from public API
URL: https://github.com/apache/arrow/pull/1400
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc 
b/cpp/src/arrow/ipc/ipc-json-test.cc
index e496826f9..12fa4bf3e 100644
--- a/cpp/src/arrow/ipc/ipc-json-test.cc
+++ b/cpp/src/arrow/ipc/ipc-json-test.cc
@@ -39,6 +39,7 @@
 
 namespace arrow {
 namespace ipc {
+namespace internal {
 namespace json {
 
 void TestSchemaRoundTrip(const Schema& schema) {
@@ -46,7 +47,7 @@ void TestSchemaRoundTrip(const Schema& schema) {
   rj::Writer writer(sb);
 
   writer.StartObject();
-  ASSERT_OK(internal::WriteSchema(schema, &writer));
+  ASSERT_OK(WriteSchema(schema, &writer));
   writer.EndObject();
 
   std::string json_schema = sb.GetString();
@@ -55,7 +56,7 @@ void TestSchemaRoundTrip(const Schema& schema) {
   d.Parse(json_schema);
 
   std::shared_ptr out;
-  if (!internal::ReadSchema(d, default_memory_pool(), &out).ok()) {
+  if (!ReadSchema(d, default_memory_pool(), &out).ok()) {
 FAIL() << "Unable to read JSON schema: " << json_schema;
   }
 
@@ -70,7 +71,7 @@ void TestArrayRoundTrip(const Array& array) {
   rj::StringBuffer sb;
   rj::Writer writer(sb);
 
-  ASSERT_OK(internal::WriteArray(name, array, &writer));
+  ASSERT_OK(WriteArray(name, array, &writer));
 
   std::string array_as_json = sb.GetString();
 
@@ -82,7 +83,7 @@ void TestArrayRoundTrip(const Array& array) {
   }
 
   std::shared_ptr out;
-  ASSERT_OK(internal::ReadArray(default_memory_pool(), d, array.type(), &out));
+  ASSERT_OK(ReadArray(default_memory_pool(), d, array.type(), &out));
 
   // std::cout << array_as_json << std::endl;
   CompareArraysDetailed(0, *out, array);
@@ -415,5 +416,6 @@ TEST_P(TestJsonRoundTrip, RoundTrip) {
 INSTANTIATE_TEST_CASE_P(TestJsonRoundTrip, TestJsonRoundTrip, BATCH_CASES());
 
 }  // namespace json
+}  // namespace internal
 }  // namespace ipc
 }  // namespace arrow
diff --git a/cpp/src/arrow/ipc/json-integration-test.cc 
b/cpp/src/arrow/ipc/json-integration-test.cc
index f362d9701..37778fa25 100644
--- a/cpp/src/arrow/ipc/json-integration-test.cc
+++ b/cpp/src/arrow/ipc/json-integration-test.cc
@@ -50,8 +50,7 @@ DEFINE_bool(verbose, true, "Verbose output");
 namespace fs = boost::filesystem;
 
 namespace arrow {
-
-class Buffer;
+namespace ipc {
 
 bool file_exists(const char* path) {
   std::ifstream handle(path);
@@ -73,16 +72,15 @@ static Status ConvertJsonToArrow(const std::string& 
json_path,
   std::shared_ptr json_buffer;
   RETURN_NOT_OK(in_file->Read(file_size, &json_buffer));
 
-  std::unique_ptr reader;
-  RETURN_NOT_OK(ipc::JsonReader::Open(json_buffer, &reader));
+  std::unique_ptr reader;
+  RETURN_NOT_OK(internal::json::JsonReader::Open(json_buffer, &reader));
 
   if (FLAGS_verbose) {
 std::cout << "Found schema: " << reader->schema()->ToString() << std::endl;
   }
 
-  std::shared_ptr writer;
-  RETURN_NOT_OK(
-  ipc::RecordBatchFileWriter::Open(out_file.get(), reader->schema(), 
&writer));
+  std::shared_ptr writer;
+  RETURN_NOT_OK(RecordBatchFileWriter::Open(out_file.get(), reader->schema(), 
&writer));
 
   for (int i = 0; i < reader->num_record_batches(); ++i) {
 std::shared_ptr batch;
@@ -101,15 +99,15 @@ static Status ConvertArrowToJson(const std::string& 
arrow_path,
   RETURN_NOT_OK(io::ReadableFile::Open(arrow_path, &in_file));
   RETURN_NOT_OK(io::FileOutputStream::Open(json_path, &out_file));
 
-  std::shared_ptr reader;
-  RETURN_NOT_OK(ipc::RecordBatchFileReader::Open(in_file.get(), &reader));
+  std::shared_ptr reader;
+  RETURN_NOT_OK(RecordBatchFileReader::Open(in_file.get(), &reader));
 
   if (FLAGS_verbose) {
 std::cout << "Found schema: " << reader->schema()->ToString() << std::endl;
   }
 
-  std::unique_ptr writer;
-  RETURN_NOT_OK(ipc::JsonWriter::Open(reader->schema(), &writer));
+  std::unique_ptr writer;
+  RETURN_NOT_OK(internal::json::JsonWriter::Open(reader->schema(), &writer));
 
   for (int i = 0; i < reader->num_record_batches(); ++i) {
 std::shared_ptr batch;
@@ -134,15 +132,15 @@ static Status ValidateArrowVsJson(const std::string& 
arrow_path,
   std::shared_ptr json_buffer;
   RETURN_NOT_OK(json_file->Read(file_size, &json_buffer));
 
-  std::unique_ptr json_reader;
-  RETURN_NOT_OK(ipc::JsonReader::Open(json_buffer, &json_reader));
+  std::unique_ptr json_reader;
+  RETURN_NOT_OK(internal::json::JsonReader::Open(json_buffer, &json_reader));
 
   /

[jira] [Updated] (ARROW-1873) [Python] Segmentation fault when loading total 2GB of parquet files

2017-12-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1873:
--
Labels: pull-request-available  (was: )

> [Python] Segmentation fault when loading total 2GB of parquet files
> ---
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: DB Tsai
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB. 
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to 
> list all the files, and then load them as pandas dataframe through pyarrow. 
> The schema of the parquet files is like 
> {code:java}
> root
>  |-- dateint: integer (nullable = true)
>  |-- profileid: long (nullable = true)
>  |-- time: long (nullable = true)
>  |-- label: double (nullable = true)
>  |-- weight: double (nullable = true)
>  |-- features: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when 
> loading 100 of them, we got segmentation fault as the following. FYI, if we 
> flatten {{features: array[double]}} into top level, the file sizes are around 
> the same, and work fine too. 
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0  0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1  0x72700b5a in 
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object*, _object**) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2  0x72714985 in arrow::Status 
> arrow::py::ConvertListsLike(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object**) () from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3  0x72716b92 in 
> arrow::py::ObjectBlock::Write(std::shared_ptr const&, long, 
> long) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4  0x7270a489 in 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
>  const ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5  0x7270a67c in std::thread::_Impl arrow::ParallelFor(int,
>  int, 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
>  ()> >::_M_run() ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6  0x71e30c5c in std::execute_native_thread_routine_compat 
> (__p=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7  0x77bc16ba in start_thread (arg=0x7fffe3392700) at 
> pthread_create.c:333
> #8  0x778f73dd in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1873) [Python] Segmentation fault when loading total 2GB of parquet files

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282516#comment-16282516
 ] 

ASF GitHub Bot commented on ARROW-1873:
---

wesm opened a new pull request #1404: ARROW-1873: [Python] Catch more possible 
Python/OOM errors in to_pandas conversion path
URL: https://github.com/apache/arrow/pull/1404
 
 
   I also ran into a gnarly method dispatching bug ARROW-1904 while working on 
this. I will address that deprecation in a separate patch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segmentation fault when loading total 2GB of parquet files
> ---
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: DB Tsai
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB. 
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to 
> list all the files, and then load them as pandas dataframe through pyarrow. 
> The schema of the parquet files is like 
> {code:java}
> root
>  |-- dateint: integer (nullable = true)
>  |-- profileid: long (nullable = true)
>  |-- time: long (nullable = true)
>  |-- label: double (nullable = true)
>  |-- weight: double (nullable = true)
>  |-- features: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when 
> loading 100 of them, we got segmentation fault as the following. FYI, if we 
> flatten {{features: array[double]}} into top level, the file sizes are around 
> the same, and work fine too. 
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0  0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1  0x72700b5a in 
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object*, _object**) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2  0x72714985 in arrow::Status 
> arrow::py::ConvertListsLike(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object**) () from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3  0x72716b92 in 
> arrow::py::ObjectBlock::Write(std::shared_ptr const&, long, 
> long) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4  0x7270a489 in 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
>  const ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5  0x7270a67c in std::thread::_Impl arrow::ParallelFor(int,
>  int, 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
>  ()> >::_M_run() ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6  0x71e30c5c in std::execute_native_thread_routine_compat 
> (__p=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7  0x77bc16ba in start_thread (arg=0x7fffe3392700) at 
> pthread_create.c:333
> #8  0x778f73dd in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1904) [C++] Using the raw_values() method on arrow::PrimitiveArray yields unreliable results on some compilers

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1904:
---

Assignee: Wes McKinney

> [C++] Using the raw_values() method on arrow::PrimitiveArray yields 
> unreliable results on some compilers
> 
>
> Key: ARROW-1904
> URL: https://issues.apache.org/jira/browse/ARROW-1904
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.8.0
>
>
> I ran into an odd issue where, even though I was casting to 
> {{arrow::PrimitiveArray}}, it picked up the {{raw_values}} method from a 
> subclass of {{PrimitiveArray}} (which includes a slice offset)
> {code}
> (gdb) p reinterpret_cast(reinterpret_cast PrimitiveArray&>(arr).raw_values())[0]
> $9 = 25
> (gdb) p reinterpret_cast(reinterpret_cast PrimitiveArray&>(arr).raw_values_)[0]
> $10 = 10
> (gdb) p arr.offset()
> $11 = 15
> {code}
> I think the {{raw_values}} method in PrimitiveArray should be deprecated and 
> removed, since it is dangerous to use as it does not include a slice offset, 
> if any



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1905) [Python] Add more functions for checking exact types in pyarrow.types

2017-12-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1905:
---

 Summary: [Python] Add more functions for checking exact types in 
pyarrow.types
 Key: ARROW-1905
 URL: https://issues.apache.org/jira/browse/ARROW-1905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.8.0


In https://github.com/apache/arrow/blob/master/python/pyarrow/types.py, we can 
check {{pyarrow.is_date}} but not whether something is date32 or date64. See 
discussion in https://github.com/apache/spark/pull/19884#discussion_r155626249



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1036) [C++] Define abstract API for filtering Arrow streams (e.g. predicate evaluation)

2017-12-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1036:

Fix Version/s: 1.0.0

> [C++] Define abstract API for filtering Arrow streams (e.g. predicate 
> evaluation)
> -
>
> Key: ARROW-1036
> URL: https://issues.apache.org/jira/browse/ARROW-1036
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 1.0.0
>
>
> It would be useful to be able to apply analytic predicates to an Arrow stream 
> in a composable way. As soon as we are able to compute some simple predicates 
> on in-memory Arrow data, we could define our first version of this



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1873) [Python] Segmentation fault when loading total 2GB of parquet files

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282585#comment-16282585
 ] 

ASF GitHub Bot commented on ARROW-1873:
---

wesm commented on issue #1404: ARROW-1873: [Python] Catch more possible 
Python/OOM errors in to_pandas conversion path
URL: https://github.com/apache/arrow/pull/1404#issuecomment-350105625
 
 
   Thanks! Here's an Appveyor build running on my fork, I will merge once this 
is passed: https://ci.appveyor.com/project/wesm/arrow/build/1.0.1566


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segmentation fault when loading total 2GB of parquet files
> ---
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: DB Tsai
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB. 
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to 
> list all the files, and then load them as pandas dataframe through pyarrow. 
> The schema of the parquet files is like 
> {code:java}
> root
>  |-- dateint: integer (nullable = true)
>  |-- profileid: long (nullable = true)
>  |-- time: long (nullable = true)
>  |-- label: double (nullable = true)
>  |-- weight: double (nullable = true)
>  |-- features: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when 
> loading 100 of them, we got segmentation fault as the following. FYI, if we 
> flatten {{features: array[double]}} into top level, the file sizes are around 
> the same, and work fine too. 
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0  0x7270fc94 in arrow::Status 
> arrow::VisitTypeInline(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1  0x72700b5a in 
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object*, _object**) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2  0x72714985 in arrow::Status 
> arrow::py::ConvertListsLike(arrow::py::PandasOptions, 
> std::shared_ptr const&, _object**) () from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3  0x72716b92 in 
> arrow::py::ObjectBlock::Write(std::shared_ptr const&, long, 
> long) ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4  0x7270a489 in 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
>  const ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5  0x7270a67c in std::thread::_Impl arrow::ParallelFor(int,
>  int, 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
>  ()> >::_M_run() ()
>from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6  0x71e30c5c in std::execute_native_thread_routine_compat 
> (__p=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7  0x77bc16ba in start_thread (arg=0x7fffe3392700) at 
> pthread_create.c:333
> #8  0x778f73dd in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1816) [Java] Resolve new vector classes structure for timestamp, date and maybe interval

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282638#comment-16282638
 ] 

ASF GitHub Bot commented on ARROW-1816:
---

icexelloss closed pull request #1330: ARROW-1816: [Java] Resolve new vector 
classes structure for timestamp, date and maybe interval
URL: https://github.com/apache/arrow/pull/1330
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/vector/pom.xml b/java/vector/pom.xml
index 46e06aa1e..b436f5f9c 100644
--- a/java/vector/pom.xml
+++ b/java/vector/pom.xml
@@ -135,6 +135,13 @@
 org.apache.drill.tools
 drill-fmpp-maven-plugin
 1.5.0
+
+  
+org.freemarker
+freemarker
+2.3.23
+  
+
 
   
 generate-fmpp
diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd 
b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd
index 970d887c7..565174a4d 100644
--- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd
+++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd
@@ -73,26 +73,10 @@
 { class: "UInt8" },
 { class: "Float8",   javaType: "double", boxedType: "Double", 
fields: [{name: "value", type: "double"}] },
 { class: "DateMilli",javaType: "long",  
friendlyType: "LocalDateTime" },
-{ class: "TimeStampSec", javaType: "long",   boxedType: "Long", 
friendlyType: "LocalDateTime" },
-{ class: "TimeStampMilli",   javaType: "long",   boxedType: "Long", 
friendlyType: "LocalDateTime" },
-{ class: "TimeStampMicro",   javaType: "long",   boxedType: "Long", 
friendlyType: "LocalDateTime" },
-{ class: "TimeStampNano",javaType: "long",   boxedType: "Long", 
friendlyType: "LocalDateTime" },
-{ class: "TimeStampSecTZ", javaType: "long",   boxedType: "Long",
- typeParams: [ {name: "timezone", type: 
"String"} ],
- arrowType: 
"org.apache.arrow.vector.types.pojo.ArrowType.Timestamp",
- arrowTypeConstructorParams: 
["org.apache.arrow.vector.types.TimeUnit.SECOND", "timezone"] },
-{ class: "TimeStampMilliTZ", javaType: "long",   boxedType: "Long",
- typeParams: [ {name: "timezone", type: 
"String"} ],
- arrowType: 
"org.apache.arrow.vector.types.pojo.ArrowType.Timestamp",
- arrowTypeConstructorParams: 
["org.apache.arrow.vector.types.TimeUnit.MILLISECOND", "timezone"] },
-{ class: "TimeStampMicroTZ", javaType: "long",   boxedType: "Long",
- typeParams: [ {name: "timezone", type: 
"String"} ],
- arrowType: 
"org.apache.arrow.vector.types.pojo.ArrowType.Timestamp",
- arrowTypeConstructorParams: 
["org.apache.arrow.vector.types.TimeUnit.MICROSECOND", "timezone"] },
-{ class: "TimeStampNanoTZ", javaType: "long",   boxedType: "Long",
- typeParams: [ {name: "timezone", type: 
"String"} ],
- arrowType: 
"org.apache.arrow.vector.types.pojo.ArrowType.Timestamp",
- arrowTypeConstructorParams: 
["org.apache.arrow.vector.types.TimeUnit.NANOSECOND", "timezone"] },
+{ class: "Timestamp",javaType: "long",   boxedType: "Long", 
friendlyType: "LocalDateTime"
+  typeParams: [ {name: "unit", type: "TimeUnit"}, { name: "timezone", 
type: "String"} ],
+  arrowType: "org.apache.arrow.vector.types.pojo.ArrowType.Timestamp",
+},
 { class: "TimeMicro" },
 { class: "TimeNano" }
   ]
@@ -116,7 +100,7 @@
 {
   class: "Decimal",
   maxPrecisionDigits: 38, nDecimalDigits: 4, friendlyType: 
"BigDecimal",
-  typeParams: [ {name: "scale", type: "int"}, { name: "precision", 
type: "int"}],
+  typeParams: [ { name: "precision", type: "int"}, {name: "scale", 
type: "int"} ],
   arrowType: "org.apache.arrow.vector.types.pojo.ArrowType.Decimal",
   fields: [{name: "start", type: "int"}, {name: "buffer", type: 
"ArrowBuf"}]
 }
diff --git a/java/vector/src/main/codegen/includes/vv_imports.ftl 
b/java/vector/src/main/codegen/includes/vv_imports.ftl
index a55304d73..28a8953e2 100644
--- a/java/vector/src/main/codegen/includes/vv_imports.ftl
+++ b/java/vector/src/main/codegen/includes/vv_imports.ftl
@@ -55,6 +55,7 @@ import java.math.BigDecimal;
 import java.math.BigIn

  1   2   >