[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-16 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089168#comment-16089168
 ] 

Wes McKinney commented on ARROW-1167:
-

Moving this to 0.5.0. Added the overflow checks in ARROW-1177 
https://github.com/apache/arrow/pull/853 but I think we can resolve this 
temporarily by chunking the binary column when it gets too large in 
Table.from_pandas

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
> Fix For: 0.5.0
>
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x00

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-06 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076847#comment-16076847
 ] 

Jeff Knupp commented on ARROW-1167:
---

Ah, OK. I misunderstood. The pandas problem is somewhat similar but that one 
_is_ caused by the size of the type used to calculate memory (re)allocations. 
Nevermind! I'll ask this question on the pandas PR. I'll also look into 
implementing a chunked array in {{Table.from_pandas}}.

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonru

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-06 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076837#comment-16076837
 ] 

Wes McKinney commented on ARROW-1167:
-

What do you mean by "Does it make sense to move to int64 to track buffer 
sizes?" ? The problem in Arrow is different, I think -- the variable length 
offsets are overflowing, the underlying memory buffers all use 64-bit integers. 
There is ARROW-750 to add string/binary types with 64-bit offsets, but in the 
meantime the easier route is to create a chunked array in Table.from_pandas 
rather than one huge array

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-06 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076822#comment-16076822
 ] 

Jeff Knupp commented on ARROW-1167:
---

So [~wesmckinn], pandas has the exact same bug (a bit easier to trigger) as 
reported here: https://github.com/pandas-dev/pandas/issues/16798. I tracked 
down where the allocation that triggers the issue is occurring and 
unsurprisingly it's when growing the buffer to accommodate the size of the 
data. I've confirmed that this, also, results in an integer overflow for the 
size to be allocated.

Now, that's all well and good, but I'd actually like to fix all of these issues 
in the two projects. *Does it make sense to move to int64 to track buffer 
sizes*? We can still check for overflow, but this solves the underlying issue 
as well.

Let me know what you think.

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/c

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071805#comment-16071805
 ] 

Wes McKinney commented on ARROW-1167:
-

OK, I believe the root cause is that one of the columns in this dataset has 
over 2GB of string data in it, which is causing an undetected overflow in the 
int32 offsets in the underlying `BinaryArray` object. So there's a bunch of 
things that need to happen:

* Detecting int32 overflow in BinaryBuilder (so constructing a malformed 
BinaryArray like this isn't possible)
* Making sure such overflows are raised properly out of Table.from_pandas
* Providing for chunked table construction in {{Table.from_pandas]} (which will 
help you fix this problem)

cc [~xhochy]

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEva

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071802#comment-16071802
 ] 

Wes McKinney commented on ARROW-1167:
-

Thanks. I'm able to reproduce; I will dig in and try to figure out the root 
cause

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x0063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_Main () at ../Modules/main.c:76

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-02 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071612#comment-16071612
 ] 

Jeff Knupp commented on ARROW-1167:
---

[~wesmckinn] I've attached a link to a bzip2d version of the source file that 
reliably reproduces the issue. I tried to see if I could reproduce it with a 
subset of the file's data, but I was only able to get it to crash (stepping in 
increments of 1,000,000 lines) at 18,000,000 out of ~18,800,000 lines, so I am 
just posting the original file in its entirety. Let me know if you have issues 
reproducing.

Link: 
[test_data.csv.bz2|https://www.dropbox.com/s/hguhamz0gdv2uzv/test_data.csv.bz2?dl=0]

MD5 for uncompressed file:

> MD5 (./test_data.csv) = 9f92942dab60d1fde04773d57759fce2

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-07-01 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071476#comment-16071476
 ] 

Jeff Knupp commented on ARROW-1167:
---

Ah, it may be that the file isn't hitting whatever limit/line was causing the 
error, since I was actually building the file to aid in recreating 
https://github.com/pandas-dev/pandas/issues/16798. I'll post a link to the 
smallest file I can create that reproduces the error (though it may require 
most of the 3+ GB file).

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_Simpl

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-30 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070891#comment-16070891
 ] 

Wes McKinney commented on ARROW-1167:
-

With that data file I'm running the following code with master branches in a 
debug build and no core dump:

{code}
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

DATA_PATH = os.path.expanduser('~/Downloads/test_data.csv.bz2')

dtypes = {'_id': str,
  'approved_budget': str,
  'business_category': str,
  'calendar_type': str,
  'classification': str,
  'client_agency_org': str,
  'client_agency_org_id': str,
  'closing_date': str,
  'collection_contact': str,
  'collection_point': str,
  'contact_person': str,
  'contact_person_address': str,
  'contract_duration': str,
  'created_by': str,
  'creation_date': str,
  'date_available': str,
  'description': str,
  'funding_instrument': str,
  'funding_source': str,
  'modified_date': str,
  'notice_type': str,
  'org_id': str,
  'other_info': str,
  'pre_bid_date': str,
  'pre_bid_venue': str,
  'procurement_mode': str,
  'procuring_entity_org': str,
  'procuring_entity_org_id': str,
  'publish_date': str,
  'reason': str,
  'ref_id': str,
  'ref_no': str,
  'serialid': str,
  'solicitation_no': str,
  'special_instruction': str,
  'stage': str,
  'stage2_ref_id': str,
  'tender_status': str,
  'tender_title': str,
  'trade_agreement': str}


df = pd.read_csv(DATA_PATH, dtype=dtypes)
table = pa.Table.from_pandas(df)

pq.write_table(table, 'test.parquet')
{code}

I'm using pandas 0.18.1; I will try latest pandas version / release builds / 
pyarrow 0.4.1 later. Let me know if I should use different code

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-30 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070576#comment-16070576
 ] 

Jeff Knupp commented on ARROW-1167:
---

Smallest amount I could get it to core reliably on is 500,000 lines (1 GB 
uncompressed). Here is a link to a bzip2 compressed version: 
https://www.dropbox.com/s/hguhamz0gdv2uzv/test_data.csv.bz2?dl=0

MD5 of uncompressed file is listed below:

MD5 (test_data.csv) = 9a66139195677008b4fcb56468e19234

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ..

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068952#comment-16068952
 ] 

Jeff Knupp commented on ARROW-1167:
---

I can't upload the whole thing (it's > 5GB) but I can certainly upload a
portion of it. Let me grab it and see if I can reproduce on a small portion
of the file.

On Thu, Jun 29, 2017 at 4:36 PM, Phillip Cloud (JIRA) 



> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x0063e7d6 in run_file (p_cf=0x7ffe6510afb0

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068944#comment-16068944
 ] 

Phillip Cloud commented on ARROW-1167:
--

[~jeffknupp] Can you upload all or part of that CSV file?

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x0063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_Main () at ../Modules/main.c:768
> #36 0x004cf

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068821#comment-16068821
 ] 

Wes McKinney commented on ARROW-1167:
-

There's another bug here ("mixed" is not a valid pandas_type in the metadata), 
reported in ARROW-1168

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x0063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_Main () at 

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Jeff Knupp (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068742#comment-16068742
 ] 

Jeff Knupp commented on ARROW-1167:
---

Yeah, I thought the same thing re: 
https://github.com/apache/parquet-cpp/pull/195. It's "odd" because there 
shouldn't be/aren't any bytes objects. It's a plain-text CSV and I'd be shocked 
if there were even any non-ascii values in there. This is from the stock 
pyarrow 0.4.1 install from PyPI. I can do a debug build of the latest version 
and report what I find.

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068718#comment-16068718
 ] 

Wes McKinney commented on ARROW-1167:
-

It seems like this could be a manifestation of 
https://github.com/apache/parquet-cpp/pull/195; if so then one of the DCHECKs 
will get triggered in a debug build, and then we can try to figure out the 
underlying cause

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x0063e7d6 in run_file (p_cf=0x7ffe6510

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068674#comment-16068674
 ] 

Wes McKinney commented on ARROW-1167:
-

Also could you clarify how the schema is "odd"? It looks like some columns have 
both unicode and bytes objects. 

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x0063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

2017-06-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068672#comment-16068672
 ] 

Wes McKinney commented on ARROW-1167:
-

What version of the software is this? Can you see if you can reproduce the 
failure with a debug build? A backtrace with debug symbols enabled would be 
helpful. If there's any way that one of us can repro the issues ourselves that 
would be very helpful

> Writing pyarrow Table to Parquet core dumps
> ---
>
> Key: ARROW-1167
> URL: https://issues.apache.org/jira/browse/ARROW-1167
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x7fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x7fbaa5c0ce97 in 
> parquet::PlainEncoder 
> >::Put(parquet::ByteArray const*, int) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x7fbaa5c18855 in 
> parquet::TypedColumnWriter 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x7fbaa5c189d5 in 
> parquet::TypedColumnWriter 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x7fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x7fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x7fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x7fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x7fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x00529885 in do_call (nk=, na=, 
> pp_stack=0x7ffe6510a6c0, func=) at ../Python/ceval.c:4933
> #12 call_function (oparg=, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510a8d0, func=) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x00528eee in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510aae0, func=) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ac10, func=) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x00528814 in fast_function (nk=, na= out>, n=, pp_stack=0x7ffe6510ad40, func=) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x0052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x0052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=, globals=, 
> locals=) at ../Python/ceval.c:777
> #31 0x005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x00