[jira] [Assigned] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

2018-05-17 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck reassigned ARROW-1644:


Assignee: Joshua Storck

> [Python] Read and write nested Parquet data with a mix of struct and list 
> nesting levels
> 
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Joshua Storck
>Priority: Major
> Fix For: 0.10.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2585) Add Decimal128::FromBigEndian

2018-05-17 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck reassigned ARROW-2585:


Assignee: Joshua Storck

> Add Decimal128::FromBigEndian
> -
>
> Key: ARROW-2585
> URL: https://issues.apache.org/jira/browse/ARROW-2585
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joshua Storck
>Assignee: Joshua Storck
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This code is being moved from 
> https://github.com/apache/parquet-cpp/blob/8046481235e558344c3aa059c83ee86b9f67/src/parquet/arrow/reader.cc#L1049
>  for us in this PR: https://github.com/apache/parquet-cpp/pull/462



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1599) [Python] Unable to read Parquet files with list inside struct

2018-05-17 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck reassigned ARROW-1599:


Assignee: Joshua Storck

> [Python] Unable to read Parquet files with list inside struct
> -
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
>Assignee: Joshua Storck
>Priority: Major
> Fix For: 0.10.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2586) Make child builders of ListBuilder and StructBuilder shared_ptr's

2018-05-17 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck reassigned ARROW-2586:


Assignee: Joshua Storck

> Make child builders of ListBuilder and StructBuilder shared_ptr's
> -
>
> Key: ARROW-2586
> URL: https://issues.apache.org/jira/browse/ARROW-2586
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joshua Storck
>Assignee: Joshua Storck
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is needed for changes in this PR that make it possible to deserialize 
> arbitrary nested structures in parquet (ARROW-1644): 
> https://github.com/apache/parquet-cpp/pull/462 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2586) Make child builders of ListBuilder and StructBuilder shared_ptr's

2018-05-15 Thread Joshua Storck (JIRA)
Joshua Storck created ARROW-2586:


 Summary: Make child builders of ListBuilder and StructBuilder 
shared_ptr's
 Key: ARROW-2586
 URL: https://issues.apache.org/jira/browse/ARROW-2586
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joshua Storck


This is needed for changes in this PR that make it possible to deserialize 
arbitrary nested structures in parquet (ARROW-1644): 
https://github.com/apache/parquet-cpp/pull/462 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2585) Add Decimal128::FromBigEndian

2018-05-15 Thread Joshua Storck (JIRA)
Joshua Storck created ARROW-2585:


 Summary: Add Decimal128::FromBigEndian
 Key: ARROW-2585
 URL: https://issues.apache.org/jira/browse/ARROW-2585
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joshua Storck


This code is being moved from 
https://github.com/apache/parquet-cpp/blob/8046481235e558344c3aa059c83ee86b9f67/src/parquet/arrow/reader.cc#L1049
 for us in this PR: https://github.com/apache/parquet-cpp/pull/462



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

2018-05-12 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473191#comment-16473191
 ] 

Joshua Storck commented on ARROW-1644:
--

The reading half of this issue is addressed by this: 
https://github.com/apache/parquet-cpp/pull/462. Perhaps we should split this 
into two separate issues?

> [Python] Read and write nested Parquet data with a mix of struct and list 
> nesting levels
> 
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 0.10.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1599) [Python] Unable to read Parquet files with list inside struct

2018-05-12 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473189#comment-16473189
 ] 

Joshua Storck commented on ARROW-1599:
--

This PR should address this: https://github.com/apache/parquet-cpp/pull/462. 
[~JKung], could you possibly test out that version or provide a sample file 
that you are trying to read so that I can add it to the unit tests?

> [Python] Unable to read Parquet files with list inside struct
> -
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
>Priority: Major
> Fix For: 0.10.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2497) Use ASSERT_NO_FATAIL_FAILURE in C++ unit tests

2018-04-23 Thread Joshua Storck (JIRA)
Joshua Storck created ARROW-2497:


 Summary: Use ASSERT_NO_FATAIL_FAILURE in C++ unit tests
 Key: ARROW-2497
 URL: https://issues.apache.org/jira/browse/ARROW-2497
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joshua Storck


A number of unit tests have helper functions that use gtest/arrow ASSERT_ 
macros. Those ASSERT_ macros simply return out of the current context and do 
not throw exceptions or abort. Since these helper functions return void, the 
unit test simply continues when the assertions are triggered. This can lead to 
additional failures, such as segfaults because the test is executing code that 
it did not expect to. By adding the gtest ASSERT_NO_FATAIL_FAILURE to the calls 
of those helper functions in the outermost scope of the unit test, the test 
will correctly terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2459) pyarrow: Segfault with pyarrow.deserialize_pandas

2018-04-17 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441371#comment-16441371
 ] 

Joshua Storck commented on ARROW-2459:
--

You are not using symmetric calls in app.py and test_request.py. Here's an 
example that works just fine:

{code:python}
import pandas as pd
import pyarrow as pa
import numpy as np
import io

df = pd.DataFrame([dict(a=99, b=100.0), dict(a=5, b=77.77)])
print(df.to_string())
serialized_df = pa.serialize_pandas(df)
bb = io.BytesIO(serialized_df)

bb = pa.py_buffer(bb.getvalue())
df = pa.deserialize_pandas(bb)
print(df.to_string())
{code}

If that works, can I close this?

> pyarrow: Segfault with pyarrow.deserialize_pandas
> -
>
> Key: ARROW-2459
> URL: https://issues.apache.org/jira/browse/ARROW-2459
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: OS X, Linux
>Reporter: Travis Brady
>Priority: Major
>
> Following up from [https://github.com/apache/arrow/issues/1884] wherein I 
> found that calling deserialize_pandas in the linked app.py script in the repo 
> linked below causes the app.py process to segfault.
> I initially observed this on OS X, but have since confirmed that the behavior 
> exists on Linux as well.
> Repo containing example: [https://github.com/travisbrady/sanic-arrow] 
> And more generally: what is the right way to get a Java-based HTTP 
> microservice to talk to a Python-based HTTP microservice using Arrow as the 
> serialization format? I'm exchanging DataFrame type objects (they are 
> pandas.DataFrame's on the Python side) between the two services for real-time 
> scoring in a few xgboost models implemented in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2020) [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps

2018-04-17 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck closed ARROW-2020.

Resolution: Duplicate

> [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit 
> timestamps
> --
>
> Key: ARROW-2020
> URL: https://issues.apache.org/jira/browse/ARROW-2020
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Assignee: Joshua Storck
>Priority: Major
>  Labels: timestamps
> Fix For: 0.10.0
>
> Attachments: crash-report.txt
>
>
> If you try to write a PyArrow table containing nanosecond-resolution 
> timestamps to Parquet using `coerce_timestamps` and 
> `use_deprecated_int96_timestamps=True`, the Arrow library will segfault.
> The crash doesn't happen if you don't coerce the timestamp resolution or if 
> you don't use 96-bit timestamps.
>  
>  
> *To Reproduce:*
>  
> {code:java}
>  
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('ns')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> coerce_timestamps='us',  # 'ms' works too
> use_deprecated_int96_timestamps=True){code}
>  
> See attached file for the crash report.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2020) [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps

2018-04-17 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441365#comment-16441365
 ] 

Joshua Storck commented on ARROW-2020:
--

This is a duplicate of ARROW-2082, which is resolve with [this 
PR|https://github.com/apache/parquet-cpp/pull/456]. [~wesmckinn] or [~cpcloud], 
could you please mark this as a duplicate? I don't seem to have permissions.

> [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit 
> timestamps
> --
>
> Key: ARROW-2020
> URL: https://issues.apache.org/jira/browse/ARROW-2020
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Assignee: Joshua Storck
>Priority: Major
>  Labels: timestamps
> Fix For: 0.10.0
>
> Attachments: crash-report.txt
>
>
> If you try to write a PyArrow table containing nanosecond-resolution 
> timestamps to Parquet using `coerce_timestamps` and 
> `use_deprecated_int96_timestamps=True`, the Arrow library will segfault.
> The crash doesn't happen if you don't coerce the timestamp resolution or if 
> you don't use 96-bit timestamps.
>  
>  
> *To Reproduce:*
>  
> {code:java}
>  
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('ns')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> coerce_timestamps='us',  # 'ms' works too
> use_deprecated_int96_timestamps=True){code}
>  
> See attached file for the crash report.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2020) [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps

2018-04-17 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck reassigned ARROW-2020:


Assignee: Joshua Storck

> [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit 
> timestamps
> --
>
> Key: ARROW-2020
> URL: https://issues.apache.org/jira/browse/ARROW-2020
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Assignee: Joshua Storck
>Priority: Major
>  Labels: timestamps
> Fix For: 0.10.0
>
> Attachments: crash-report.txt
>
>
> If you try to write a PyArrow table containing nanosecond-resolution 
> timestamps to Parquet using `coerce_timestamps` and 
> `use_deprecated_int96_timestamps=True`, the Arrow library will segfault.
> The crash doesn't happen if you don't coerce the timestamp resolution or if 
> you don't use 96-bit timestamps.
>  
>  
> *To Reproduce:*
>  
> {code:java}
>  
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('ns')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> coerce_timestamps='us',  # 'ms' works too
> use_deprecated_int96_timestamps=True){code}
>  
> See attached file for the crash report.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2393) [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK

2018-04-16 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439950#comment-16439950
 ] 

Joshua Storck commented on ARROW-2393:
--

I don't think the ARROW_CHECK_OK and ARROW_CHECK_OK_PREPEND macros should be in 
status.h. They use the logging facilities and should probably be in logging.h, 
which shouldn't be visible.

The interesting thing is that the RETURN_NOT_OK macros don't work outside of 
the arrow namespace. I think they need to be updated to use ::arrow::Status in 
their bodies.

[~wesmckinn], [~pitrou], or [~cpcloud], does that make sense? If so, I'll 
submit a PR.

> [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK
> --
>
> Key: ARROW-2393
> URL: https://issues.apache.org/jira/browse/ARROW-2393
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: dennis lucero
>Priority: Trivial
>
> test.cpp
> {code:c++}
> #include 
> int main(void) {
> arrow::Int64Builder i64builder;
> std::shared_ptr i64array;
> ARROW_CHECK_OK(i64builder.Finish());
> return EXIT_SUCCESS;
> }
> {code}
> Attempt to build:
> {code:bash}
> $CXX test.cpp -std=c++11 -larrow
> {code}
> Error:
> {code}
> test.cpp:6:2: error: use of undeclared identifier 'ARROW_CHECK' 
> ARROW_CHECK_OK(i64builder.Finish()); ^ 
> xxx/include/arrow/status.h:49:27: note: expanded from macro 'ARROW_CHECK_OK' 
> #define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status") ^ 
> xxx/include/arrow/status.h:44:5: note: expanded from macro 
> 'ARROW_CHECK_OK_PREPEND' ARROW_CHECK(_s.ok()) << (msg) << ": " << 
> _s.ToString(); \ ^ 1 error generated.
> {code}
> I expect that ARROW_* macro are public API, and should work out of the box.
> A naive attempt to fix it
> {code}
> diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h
> index 84f55e41..6da4a773 100644
> --- a/cpp/src/arrow/status.h
> +++ b/cpp/src/arrow/status.h
> @@ -25,6 +25,7 @@
>  #include "arrow/util/macros.h"
>  #include "arrow/util/visibility.h"
> +#include "arrow/util/logging.h"
>  // Return the given status if it is not OK.
>  #define ARROW_RETURN_NOT_OK(s)   \
> {code}
> fails with
> {code}
> public-api-test.cc:21:2: error: "DCHECK should not be visible from Arrow 
> public headers."
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-16 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439892#comment-16439892
 ] 

Joshua Storck commented on ARROW-2429:
--

If you invoke the write_table function as follows, the type will not change:

{code:python}
pq.write_table(table, 'foo.parquet', use_deprecated_int96_timestamps=True)
{code}


> [Python] Timestamp unit in schema changes when writing to Parquet file then 
> reading back
> 
>
> Key: ARROW-2429
> URL: https://issues.apache.org/jira/browse/ARROW-2429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> PyArrow 0.9.0 (py36_1)
> Python
>Reporter: Dave Challis
>Priority: Minor
>
> When creating an Arrow table from a Pandas DataFrame, the table schema 
> contains a field of type `timestamp[ns]`.
> When serialising that table to a parquet file and then immediately reading it 
> back, the schema of the table read instead contains a field with type 
> `timestamp[us]`.
> Minimal example:
>  
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-04-12 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436439#comment-16436439
 ] 

Joshua Storck commented on ARROW-2082:
--

I did some debugging and isolated the issue. The column writer that is being 
created is int64 
([https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/column_writer.cc#L559),]
 but the codepath taken for writing is assuming int96 
([https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L599|https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L599.]).
 Unfortunately, there is a static_cast that is being made here: 
[https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L378.]
 

That last section of code does a static cast to the wrong type. That means it's 
writing 96 at a time when there's only space for 64 at a time. That's probably 
corrupting memory. I have a feeling the location of the segfault would be 
dependent on the input data.

I need to take a closer look and figure out what the best course of action is 
here. I'm not fond of the static_cast. If we were using dynamic_cast here, at 
least we could put an assertion in a debug build and/or check to make sure the 
C types match between the writer_ and the value returned from the static_cast.

I suspect there is some mismatch between how the column metadata is initialized 
and how it is used ArrayColumnWriter::WriteTimestamps.

 

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-04-10 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck updated ARROW-1938:
-
Comment: was deleted

(was: Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453)

> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-04-10 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432773#comment-16432773
 ] 

Joshua Storck commented on ARROW-1938:
--

Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453

> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)