[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Knupp updated ARROW-1167:
------------------------------
    Description: 
When writing a pyarrow Table (instantiated from a Pandas dataframe reading in a 
~5GB CSV file) to a parquet file, the interpreter cores with the following 
stack trace from gdb:

{code}
#0  __memmove_avx_unaligned () at 
../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
#1  0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
const*, long) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#2  0x00007fbaa5c0ce97 in 
parquet::PlainEncoder<parquet::DataType<(parquet::Type::type)6> 
>::Put(parquet::ByteArray const*, int) ()
   from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#3  0x00007fbaa5c18855 in 
parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#4  0x00007fbaa5c189d5 in 
parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
>::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
   from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#5  0x00007fbaa5be0900 in arrow::Status 
parquet::arrow::FileWriter::Impl::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
 arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr<arrow::Array> 
const&, long, short const*, short const*) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#6  0x00007fbaa5be171d in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#7  0x00007fbaa5be1dad in 
parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#8  0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#9  0x00007fbaa51e1f53 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
#10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
#11 0x0000000000529885 in do_call (nk=<optimized out>, na=<optimized out>, 
pp_stack=0x7ffe6510a6c0, func=<optimized out>) at ../Python/ceval.c:4933
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a6c0) at 
../Python/ceval.c:4732
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#14 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#15 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510a8d0, func=<optimized out>) at 
../Python/ceval.c:4813
#16 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a8d0) at 
../Python/ceval.c:4730
#17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#18 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#19 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510aae0, func=<optimized out>) at 
../Python/ceval.c:4813
#20 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510aae0) at 
../Python/ceval.c:4730
#21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#22 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510ac10, func=<optimized out>) at 
../Python/ceval.c:4803
#23 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ac10) at 
../Python/ceval.c:4730
#24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#25 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510ad40, func=<optimized out>) at 
../Python/ceval.c:4803
#26 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ad40) at 
../Python/ceval.c:4730
#27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#28 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#29 0x000000000052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
#30 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
locals=<optimized out>) at ../Python/ceval.c:777
#31 0x00000000005fd2c2 in run_mod () at ../Python/pythonrun.c:976
#32 0x00000000005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
#33 0x00000000005ff95c in PyRun_SimpleFileExFlags () at 
../Python/pythonrun.c:396
#34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
#35 Py_Main () at ../Modules/main.c:768
#36 0x00000000004cfe41 in main () at ../Programs/python.c:65
#37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 <main>, argc=2, 
argv=0x7ffe6510b1c8, init=<optimized out>, fini=<optimized out>, 
rtld_fini=<optimized out>, stack_end=0x7ffe6510b1b8)
    at ../csu/libc-start.c:291
#38 0x00000000005d5f29 in _start ()
{code]

This is occurring in a pretty vanilla call to `pq.write_table(table, output)`. 
Before the crash, I'm able to print out the table's schema and it looks a 
little odd (all columns are explicitly specified in {{pandas.read_csv()}} to be 
strings...

{code]
_id: string
ref_id: string
ref_no: string
stage: string
stage2_ref_id: string
org_id: string
classification: string
solicitation_no: string
notice_type: string
business_category: string
procurement_mode: string
funding_instrument: string
funding_source: string
approved_budget: string
publish_date: string
closing_date: string
contract_duration: string
calendar_type: string
trade_agreement: string
pre_bid_date: string
pre_bid_venue: string
procuring_entity_org_id: string
procuring_entity_org: string
client_agency_org_id: string
client_agency_org: string
contact_person: string
contact_person_address: string
tender_title: string
description: string
other_info: string
reason: string
created_by: string
creation_date: string
modified_date: string
special_instruction: string
collection_contact: string
tender_status: string
collection_point: string
date_available: string
serialid: string
__index_level_0__: int64
-- metadata --
pandas: {"index_columns": ["__index_level_0__"], "columns": [{"pandas_type": 
"unicode", "numpy_type": "object", "metadata": null, "name": "_id"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, 
"name": "ref_no"}, {"pandas_type": "unicode", "numpy_type": "object", 
"metadata": null, "name": "stage"}, {"pandas_type": "mixed", "numpy_type": 
"object", "metadata": null, "name": "stage2_ref_id"}, {"pandas_type": 
"unicode", "numpy_type": "object", "metadata": null, "name": "org_id"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"classification"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "solicitation_no"}, {"pandas_type": "unicode", "numpy_type": 
"object", "metadata": null, "name": "notice_type"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "business_category"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"procurement_mode"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "funding_instrument"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "funding_source"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"approved_budget"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "publish_date"}, {"pandas_type": "mixed", 
"numpy_type": "object", "metadata": null, "name": "closing_date"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"contract_duration"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "calendar_type"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "trade_agreement"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"pre_bid_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "pre_bid_venue"}, {"pandas_type": "unicode", "numpy_type": 
"object", "metadata": null, "name": "procuring_entity_org_id"}, {"pandas_type": 
"unicode", "numpy_type": "object", "metadata": null, "name": 
"procuring_entity_org"}, {"pandas_type": "unicode", "numpy_type": "object", 
"metadata": null, "name": "client_agency_org_id"}, {"pandas_type": "mixed", 
"numpy_type": "object", "metadata": null, "name": "client_agency_org"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"contact_person"}, {"pandas_type": "unicode", "numpy_type": "object", 
"metadata": null, "name": "contact_person_address"}, {"pandas_type": "mixed", 
"numpy_type": "object", "metadata": null, "name": "tender_title"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"description"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "other_info"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "reason"}, {"pandas_type": "unicode", "numpy_type": 
"object", "metadata": null, "name": "created_by"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "creation_date"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"modified_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "special_instruction"}, {"pandas_type": "mixed", "numpy_type": 
"object", "metadata": null, "name": "collection_contact"}, {"pandas_type": 
"mixed", "numpy_type": "object", "metadata": null, "name": "tender_status"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"collection_point"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "date_available"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "serialid"}, {"pandas_type": 
"int64", "numpy_type": "int64", "metadata": null, "name": 
"__index_level_0__"}], "pandas_version": "0.19.2"}
Segmentation fault (core dumped)
{code]

  was:
When writing a pyarrow Table (instantiated from a Pandas dataframe reading in a 
~5GB CSV file) to a parquet file, the interpreter cores with the following 
stack trace from gdb:

```
#0  __memmove_avx_unaligned () at 
../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
#1  0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
const*, long) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#2  0x00007fbaa5c0ce97 in 
parquet::PlainEncoder<parquet::DataType<(parquet::Type::type)6> 
>::Put(parquet::ByteArray const*, int) ()
   from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#3  0x00007fbaa5c18855 in 
parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#4  0x00007fbaa5c189d5 in 
parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
>::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
   from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#5  0x00007fbaa5be0900 in arrow::Status 
parquet::arrow::FileWriter::Impl::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
 arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr<arrow::Array> 
const&, long, short const*, short const*) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#6  0x00007fbaa5be171d in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#7  0x00007fbaa5be1dad in 
parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#8  0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) () from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
#9  0x00007fbaa51e1f53 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
#10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
#11 0x0000000000529885 in do_call (nk=<optimized out>, na=<optimized out>, 
pp_stack=0x7ffe6510a6c0, func=<optimized out>) at ../Python/ceval.c:4933
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a6c0) at 
../Python/ceval.c:4732
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#14 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#15 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510a8d0, func=<optimized out>) at 
../Python/ceval.c:4813
#16 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a8d0) at 
../Python/ceval.c:4730
#17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#18 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#19 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510aae0, func=<optimized out>) at 
../Python/ceval.c:4813
#20 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510aae0) at 
../Python/ceval.c:4730
#21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#22 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510ac10, func=<optimized out>) at 
../Python/ceval.c:4803
#23 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ac10) at 
../Python/ceval.c:4730
#24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#25 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
out>, n=<optimized out>, pp_stack=0x7ffe6510ad40, func=<optimized out>) at 
../Python/ceval.c:4803
#26 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ad40) at 
../Python/ceval.c:4730
#27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
#28 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at ../Python/ceval.c:4018
#29 0x000000000052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
#30 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
locals=<optimized out>) at ../Python/ceval.c:777
#31 0x00000000005fd2c2 in run_mod () at ../Python/pythonrun.c:976
#32 0x00000000005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
#33 0x00000000005ff95c in PyRun_SimpleFileExFlags () at 
../Python/pythonrun.c:396
#34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
#35 Py_Main () at ../Modules/main.c:768
#36 0x00000000004cfe41 in main () at ../Programs/python.c:65
#37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 <main>, argc=2, 
argv=0x7ffe6510b1c8, init=<optimized out>, fini=<optimized out>, 
rtld_fini=<optimized out>, stack_end=0x7ffe6510b1b8)
    at ../csu/libc-start.c:291
#38 0x00000000005d5f29 in _start ()
```

This is occurring in a pretty vanilla call to `pq.write_table(table, output)`. 
Before the crash, I'm able to print out the table's schema and it looks a 
little odd (all columns are explicitly specified in {{pandas.read_csv()}} to be 
strings...

```
_id: string
ref_id: string
ref_no: string
stage: string
stage2_ref_id: string
org_id: string
classification: string
solicitation_no: string
notice_type: string
business_category: string
procurement_mode: string
funding_instrument: string
funding_source: string
approved_budget: string
publish_date: string
closing_date: string
contract_duration: string
calendar_type: string
trade_agreement: string
pre_bid_date: string
pre_bid_venue: string
procuring_entity_org_id: string
procuring_entity_org: string
client_agency_org_id: string
client_agency_org: string
contact_person: string
contact_person_address: string
tender_title: string
description: string
other_info: string
reason: string
created_by: string
creation_date: string
modified_date: string
special_instruction: string
collection_contact: string
tender_status: string
collection_point: string
date_available: string
serialid: string
__index_level_0__: int64
-- metadata --
pandas: {"index_columns": ["__index_level_0__"], "columns": [{"pandas_type": 
"unicode", "numpy_type": "object", "metadata": null, "name": "_id"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, 
"name": "ref_no"}, {"pandas_type": "unicode", "numpy_type": "object", 
"metadata": null, "name": "stage"}, {"pandas_type": "mixed", "numpy_type": 
"object", "metadata": null, "name": "stage2_ref_id"}, {"pandas_type": 
"unicode", "numpy_type": "object", "metadata": null, "name": "org_id"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"classification"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "solicitation_no"}, {"pandas_type": "unicode", "numpy_type": 
"object", "metadata": null, "name": "notice_type"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "business_category"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"procurement_mode"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "funding_instrument"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "funding_source"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"approved_budget"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "publish_date"}, {"pandas_type": "mixed", 
"numpy_type": "object", "metadata": null, "name": "closing_date"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"contract_duration"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "calendar_type"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "trade_agreement"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"pre_bid_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "pre_bid_venue"}, {"pandas_type": "unicode", "numpy_type": 
"object", "metadata": null, "name": "procuring_entity_org_id"}, {"pandas_type": 
"unicode", "numpy_type": "object", "metadata": null, "name": 
"procuring_entity_org"}, {"pandas_type": "unicode", "numpy_type": "object", 
"metadata": null, "name": "client_agency_org_id"}, {"pandas_type": "mixed", 
"numpy_type": "object", "metadata": null, "name": "client_agency_org"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"contact_person"}, {"pandas_type": "unicode", "numpy_type": "object", 
"metadata": null, "name": "contact_person_address"}, {"pandas_type": "mixed", 
"numpy_type": "object", "metadata": null, "name": "tender_title"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"description"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "other_info"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "reason"}, {"pandas_type": "unicode", "numpy_type": 
"object", "metadata": null, "name": "created_by"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "creation_date"}, 
{"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
"modified_date"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
null, "name": "special_instruction"}, {"pandas_type": "mixed", "numpy_type": 
"object", "metadata": null, "name": "collection_contact"}, {"pandas_type": 
"mixed", "numpy_type": "object", "metadata": null, "name": "tender_status"}, 
{"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
"collection_point"}, {"pandas_type": "mixed", "numpy_type": "object", 
"metadata": null, "name": "date_available"}, {"pandas_type": "unicode", 
"numpy_type": "object", "metadata": null, "name": "serialid"}, {"pandas_type": 
"int64", "numpy_type": "int64", "metadata": null, "name": 
"__index_level_0__"}], "pandas_version": "0.19.2"}
Segmentation fault (core dumped)
```


> Writing pyarrow Table to Parquet core dumps
> -------------------------------------------
>
>                 Key: ARROW-1167
>                 URL: https://issues.apache.org/jira/browse/ARROW-1167
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x00007fbaa5c0ce97 in 
> parquet::PlainEncoder<parquet::DataType<(parquet::Type::type)6> 
> >::Put(parquet::ByteArray const*, int) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x00007fbaa5c18855 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x00007fbaa5c189d5 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x00007fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr<arrow::Array> 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x00007fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x00007fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x00007fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x0000000000529885 in do_call (nk=<optimized out>, na=<optimized out>, 
> pp_stack=0x7ffe6510a6c0, func=<optimized out>) at ../Python/ceval.c:4933
> #12 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510a8d0, func=<optimized out>) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510aae0, func=<optimized out>) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510ac10, func=<optimized out>) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510ad40, func=<optimized out>) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x000000000052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
> locals=<optimized out>) at ../Python/ceval.c:777
> #31 0x00000000005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x00000000005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x00000000005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_Main () at ../Modules/main.c:768
> #36 0x00000000004cfe41 in main () at ../Programs/python.c:65
> #37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 <main>, argc=2, 
> argv=0x7ffe6510b1c8, init=<optimized out>, fini=<optimized out>, 
> rtld_fini=<optimized out>, stack_end=0x7ffe6510b1b8)
>     at ../csu/libc-start.c:291
> #38 0x00000000005d5f29 in _start ()
> {code]
> This is occurring in a pretty vanilla call to `pq.write_table(table, 
> output)`. Before the crash, I'm able to print out the table's schema and it 
> looks a little odd (all columns are explicitly specified in 
> {{pandas.read_csv()}} to be strings...
> {code]
> _id: string
> ref_id: string
> ref_no: string
> stage: string
> stage2_ref_id: string
> org_id: string
> classification: string
> solicitation_no: string
> notice_type: string
> business_category: string
> procurement_mode: string
> funding_instrument: string
> funding_source: string
> approved_budget: string
> publish_date: string
> closing_date: string
> contract_duration: string
> calendar_type: string
> trade_agreement: string
> pre_bid_date: string
> pre_bid_venue: string
> procuring_entity_org_id: string
> procuring_entity_org: string
> client_agency_org_id: string
> client_agency_org: string
> contact_person: string
> contact_person_address: string
> tender_title: string
> description: string
> other_info: string
> reason: string
> created_by: string
> creation_date: string
> modified_date: string
> special_instruction: string
> collection_contact: string
> tender_status: string
> collection_point: string
> date_available: string
> serialid: string
> __index_level_0__: int64
> -- metadata --
> pandas: {"index_columns": ["__index_level_0__"], "columns": [{"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": "_id"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": 
> null, "name": "ref_no"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "stage"}, {"pandas_type": "mixed", "numpy_type": 
> "object", "metadata": null, "name": "stage2_ref_id"}, {"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": "org_id"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "classification"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "solicitation_no"}, {"pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null, "name": "notice_type"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "business_category"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "procurement_mode"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "funding_instrument"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "funding_source"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "approved_budget"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "publish_date"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "closing_date"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "contract_duration"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "calendar_type"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "trade_agreement"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "pre_bid_date"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "pre_bid_venue"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "procuring_entity_org_id"}, {"pandas_type": "unicode", "numpy_type": 
> "object", "metadata": null, "name": "procuring_entity_org"}, {"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": 
> "client_agency_org_id"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "client_agency_org"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "contact_person"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "contact_person_address"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "tender_title"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "description"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "other_info"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
> null, "name": "reason"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "created_by"}, {"pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null, "name": "creation_date"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "modified_date"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "special_instruction"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "collection_contact"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "tender_status"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "collection_point"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "date_available"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "serialid"}, {"pandas_type": "int64", "numpy_type": "int64", "metadata": 
> null, "name": "__index_level_0__"}], "pandas_version": "0.19.2"}
> Segmentation fault (core dumped)
> {code]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to