[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441198#comment-16441198
 ] 

ASF GitHub Bot commented on ARROW-2082:
---

xhochy commented on a change in pull request #456: ARROW-2082: Prevent segfault 
that was occurring when writing a nanosecond timestamp with arrow writer 
properties set to coerce timestamps and support deprecated int96 timestamps.
URL: https://github.com/apache/parquet-cpp/pull/456#discussion_r182158877
 
 

 ##
 File path: src/parquet/arrow/arrow-reader-writer-test.cc
 ##
 @@ -1403,6 +1403,56 @@ TEST(TestArrowReadWrite, ConvertedDateTimeTypes) {
   AssertTablesEqual(*ex_table, *result);
 }
 
+// Regression for ARROW-2802
+TEST(TestArrowReadWrite, CoerceTimestampsAndSupportDeprecatedInt96) {
+  using namespace ::arrow;
 
 Review comment:
   Please list the types you use explicitly here, e.g. `using 
::arrow::TimestampType`. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439852#comment-16439852
 ] 

ASF GitHub Bot commented on ARROW-2082:
---

joshuastorck opened a new pull request #456: ARROW-2082: Prevent segfault that 
was occurring when writing a nanosecond timestamp with arrow writer properties 
set to coerce timestamps and support deprecated int96 timestamps.
URL: https://github.com/apache/parquet-cpp/pull/456
 
 
   The bug was a due to the fact that the physical type was int64 but the 
WriteTimestamps function was taking a path that assumed the physical type was 
int96. This caused memory corruption because it was writing past the end of the 
array. The bug was fixed by checking that coerce timestamps is disabled when 
writing int96. 
   
   A unit test was added for the regression.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-04-12 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436439#comment-16436439
 ] 

Joshua Storck commented on ARROW-2082:
--

I did some debugging and isolated the issue. The column writer that is being 
created is int64 
([https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/column_writer.cc#L559),]
 but the codepath taken for writing is assuming int96 
([https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L599|https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L599.]).
 Unfortunately, there is a static_cast that is being made here: 
[https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L378.]
 

That last section of code does a static cast to the wrong type. That means it's 
writing 96 at a time when there's only space for 64 at a time. That's probably 
corrupting memory. I have a feeling the location of the segfault would be 
dependent on the input data.

I need to take a closer look and figure out what the best course of action is 
here. I'm not fond of the static_cast. If we were using dynamic_cast here, at 
least we could put an assertion in a debug build and/or check to make sure the 
C types match between the writer_ and the value returned from the static_cast.

I suspect there is some mismatch between how the column metadata is initialized 
and how it is used ArrayColumnWriter::WriteTimestamps.

 

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396525#comment-16396525
 ] 

Wes McKinney commented on ARROW-2082:
-

Moved to 0.10.0. My best guess is that this bug lies in parquet-cpp. Let's try 
to fix this soon

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396524#comment-16396524
 ] 

Wes McKinney commented on ARROW-2082:
-

I haven't found the bad memory access yet -- a pointer has gotten corrupted and 
the segfault is happening there:

{code}
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x7268f769 in arrow::PoolBuffer::Reserve (this=0xdfad80, capacity=1024) 
at ../src/arrow/buffer.cc:101
101   RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data));
(gdb) p pool_
$1 = (arrow::MemoryPool *) 0x72d74300 
(gdb) p ::arrow::default_memory_pool()
$2 = (arrow::MemoryPool *) 0x72d74310 

{code}

Those memory addresses should be the same. Here is a minimal script repro'ing 
the error:

{code:language=python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from io import StringIO

content = StringIO(
"""report_time_of_generation
2018-02-01 03:44:29
2018-02-01 03:44:29
2018-02-01 03:44:29
2018-02-01 03:44:29
2018-02-01 03:44:29
""")

dfs = pd.read_csv(content, parse_dates=['report_time_of_generation'])

table = pa.Table.from_pandas(dfs)

pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy',
   coerce_timestamps='ms')
{code}

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.9.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396514#comment-16396514
 ] 

Wes McKinney commented on ARROW-2082:
-

Here's the backtrace for this:

{code}
#0  0x7fffece34769 in arrow::PoolBuffer::Reserve (this=0x139c180, 
capacity=1024) at ../src/arrow/buffer.cc:101
#1  0x7fffece34b2f in arrow::PoolBuffer::Resize (this=0x139c180, 
new_size=1024, shrink_to_fit=true) at ../src/arrow/buffer.cc:112
#2  0x7fffcb5fc506 in parquet::AllocateBuffer (pool=0x7fffed519300 
, size=1024) at ../src/parquet/util/memory.cc:501
#3  0x7fffcb5fc75e in parquet::InMemoryOutputStream::InMemoryOutputStream 
(this=0x1487090, pool=0x7fffed519300 , initial_capacity=1024) at 
../src/parquet/util/memory.cc:423
#4  0x7fffcb5335ca in 
parquet::PlainEncoder >::PlainEncoder 
(this=0x7fff9170, descr=0x1104060, pool=0x7fffed519300 )
at ../src/parquet/encoding-internal.h:188
#5  0x7fffcb5defa2 in 
parquet::TypedRowGroupStatistics 
>::PlainEncode (this=0xbbee60, src=@0xbbeec8: -729020189051312384, 
dst=0x7fff9258)
at ../src/parquet/statistics.cc:228
#6  0x7fffcb5def07 in 
parquet::TypedRowGroupStatistics 
>::EncodeMin (this=0xbbee60) at ../src/parquet/statistics.cc:204
#7  0x7fffcb5df1c3 in 
parquet::TypedRowGroupStatistics 
>::Encode (this=0xbbee60) at ../src/parquet/statistics.cc:219
#8  0x7fffcb5348f7 in 
parquet::TypedColumnWriter 
>::GetPageStatistics (this=0x81d2b0) at ../src/parquet/column_writer.cc:520
#9  0x7fffcb52ca76 in parquet::ColumnWriter::AddDataPage (this=0x81d2b0) at 
../src/parquet/column_writer.cc:386
#10 0x7fffcb52c0eb in parquet::ColumnWriter::FlushBufferedDataPages 
(this=0x81d2b0) at ../src/parquet/column_writer.cc:447
#11 0x7fffcb52ddb0 in parquet::ColumnWriter::Close (this=0x81d2b0) at 
../src/parquet/column_writer.cc:431
#12 0x7fffcb4d6657 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Close (this=0x7fff9b48) at 
../src/parquet/arrow/writer.cc:347
#13 0x7fffcb4e758e in parquet::arrow::FileWriter::Impl::WriteColumnChunk 
(this=0x15adee0, data=warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 2, weak 0) 0x1717cc0, offset=0, size=5)
at ../src/parquet/arrow/writer.cc:982
#14 0x7fffcb4d507b in parquet::arrow::FileWriter::WriteColumnChunk 
(this=0x125bc30, data=warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 2, weak 0) 0x1717cc0, offset=0, size=5)
at ../src/parquet/arrow/writer.cc:1011
#15 0x7fffcb4d5ba6 in parquet::arrow::FileWriter::WriteTable 
(this=0x125bc30, table=..., chunk_size=5) at ../src/parquet/arrow/writer.cc:1086
{code}

Not sure what's going wrong yet

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.9.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)