[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441198#comment-16441198 ] ASF GitHub Bot commented on ARROW-2082: --- xhochy commented on a change in pull request #456: ARROW-2082: Prevent segfault that was occurring when writing a nanosecond timestamp with arrow writer properties set to coerce timestamps and support deprecated int96 timestamps. URL: https://github.com/apache/parquet-cpp/pull/456#discussion_r182158877 ## File path: src/parquet/arrow/arrow-reader-writer-test.cc ## @@ -1403,6 +1403,56 @@ TEST(TestArrowReadWrite, ConvertedDateTimeTypes) { AssertTablesEqual(*ex_table, *result); } +// Regression for ARROW-2802 +TEST(TestArrowReadWrite, CoerceTimestampsAndSupportDeprecatedInt96) { + using namespace ::arrow; Review comment: Please list the types you use explicitly here, e.g. `using ::arrow::TimestampType`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439852#comment-16439852 ] ASF GitHub Bot commented on ARROW-2082: --- joshuastorck opened a new pull request #456: ARROW-2082: Prevent segfault that was occurring when writing a nanosecond timestamp with arrow writer properties set to coerce timestamps and support deprecated int96 timestamps. URL: https://github.com/apache/parquet-cpp/pull/456 The bug was a due to the fact that the physical type was int64 but the WriteTimestamps function was taking a path that assumed the physical type was int96. This caused memory corruption because it was writing past the end of the array. The bug was fixed by checking that coerce timestamps is disabled when writing int96. A unit test was added for the regression. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436439#comment-16436439 ] Joshua Storck commented on ARROW-2082: -- I did some debugging and isolated the issue. The column writer that is being created is int64 ([https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/column_writer.cc#L559),] but the codepath taken for writing is assuming int96 ([https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L599|https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L599.]). Unfortunately, there is a static_cast that is being made here: [https://github.com/apache/parquet-cpp/blob/master-after-apache-parquet-cpp-1.4.0-rc1/src/parquet/arrow/writer.cc#L378.] That last section of code does a static cast to the wrong type. That means it's writing 96 at a time when there's only space for 64 at a time. That's probably corrupting memory. I have a feeling the location of the segfault would be dependent on the input data. I need to take a closer look and figure out what the best course of action is here. I'm not fond of the static_cast. If we were using dynamic_cast here, at least we could put an assertion in a debug build and/or check to make sure the C types match between the writer_ and the value returned from the static_cast. I suspect there is some mismatch between how the column metadata is initialized and how it is used ArrayColumnWriter::WriteTimestamps. > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Fix For: 0.10.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396525#comment-16396525 ] Wes McKinney commented on ARROW-2082: - Moved to 0.10.0. My best guess is that this bug lies in parquet-cpp. Let's try to fix this soon > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Fix For: 0.10.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396524#comment-16396524 ] Wes McKinney commented on ARROW-2082: - I haven't found the bad memory access yet -- a pointer has gotten corrupted and the segfault is happening there: {code} Thread 1 "python" received signal SIGSEGV, Segmentation fault. 0x7268f769 in arrow::PoolBuffer::Reserve (this=0xdfad80, capacity=1024) at ../src/arrow/buffer.cc:101 101 RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); (gdb) p pool_ $1 = (arrow::MemoryPool *) 0x72d74300 (gdb) p ::arrow::default_memory_pool() $2 = (arrow::MemoryPool *) 0x72d74310 {code} Those memory addresses should be the same. Here is a minimal script repro'ing the error: {code:language=python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from io import StringIO content = StringIO( """report_time_of_generation 2018-02-01 03:44:29 2018-02-01 03:44:29 2018-02-01 03:44:29 2018-02-01 03:44:29 2018-02-01 03:44:29 """) dfs = pd.read_csv(content, parse_dates=['report_time_of_generation']) table = pa.Table.from_pandas(dfs) pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', coerce_timestamps='ms') {code} > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Fix For: 0.9.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396514#comment-16396514 ] Wes McKinney commented on ARROW-2082: - Here's the backtrace for this: {code} #0 0x7fffece34769 in arrow::PoolBuffer::Reserve (this=0x139c180, capacity=1024) at ../src/arrow/buffer.cc:101 #1 0x7fffece34b2f in arrow::PoolBuffer::Resize (this=0x139c180, new_size=1024, shrink_to_fit=true) at ../src/arrow/buffer.cc:112 #2 0x7fffcb5fc506 in parquet::AllocateBuffer (pool=0x7fffed519300 , size=1024) at ../src/parquet/util/memory.cc:501 #3 0x7fffcb5fc75e in parquet::InMemoryOutputStream::InMemoryOutputStream (this=0x1487090, pool=0x7fffed519300 , initial_capacity=1024) at ../src/parquet/util/memory.cc:423 #4 0x7fffcb5335ca in parquet::PlainEncoder >::PlainEncoder (this=0x7fff9170, descr=0x1104060, pool=0x7fffed519300 ) at ../src/parquet/encoding-internal.h:188 #5 0x7fffcb5defa2 in parquet::TypedRowGroupStatistics >::PlainEncode (this=0xbbee60, src=@0xbbeec8: -729020189051312384, dst=0x7fff9258) at ../src/parquet/statistics.cc:228 #6 0x7fffcb5def07 in parquet::TypedRowGroupStatistics >::EncodeMin (this=0xbbee60) at ../src/parquet/statistics.cc:204 #7 0x7fffcb5df1c3 in parquet::TypedRowGroupStatistics >::Encode (this=0xbbee60) at ../src/parquet/statistics.cc:219 #8 0x7fffcb5348f7 in parquet::TypedColumnWriter >::GetPageStatistics (this=0x81d2b0) at ../src/parquet/column_writer.cc:520 #9 0x7fffcb52ca76 in parquet::ColumnWriter::AddDataPage (this=0x81d2b0) at ../src/parquet/column_writer.cc:386 #10 0x7fffcb52c0eb in parquet::ColumnWriter::FlushBufferedDataPages (this=0x81d2b0) at ../src/parquet/column_writer.cc:447 #11 0x7fffcb52ddb0 in parquet::ColumnWriter::Close (this=0x81d2b0) at ../src/parquet/column_writer.cc:431 #12 0x7fffcb4d6657 in parquet::arrow::(anonymous namespace)::ArrowColumnWriter::Close (this=0x7fff9b48) at ../src/parquet/arrow/writer.cc:347 #13 0x7fffcb4e758e in parquet::arrow::FileWriter::Impl::WriteColumnChunk (this=0x15adee0, data=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>' warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>' std::shared_ptr (count 2, weak 0) 0x1717cc0, offset=0, size=5) at ../src/parquet/arrow/writer.cc:982 #14 0x7fffcb4d507b in parquet::arrow::FileWriter::WriteColumnChunk (this=0x125bc30, data=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>' warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>' std::shared_ptr (count 2, weak 0) 0x1717cc0, offset=0, size=5) at ../src/parquet/arrow/writer.cc:1011 #15 0x7fffcb4d5ba6 in parquet::arrow::FileWriter::WriteTable (this=0x125bc30, table=..., chunk_size=5) at ../src/parquet/arrow/writer.cc:1086 {code} Not sure what's going wrong yet > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Fix For: 0.9.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)