[ https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932691#comment-16932691 ]
Wes McKinney commented on ARROW-6573: ------------------------------------- This raises an exception in master {code} import pyarrow as pa import pyarrow.parquet as pq data = dict() data["key"] = [0, 1, 2, 3] # segfault #data["key"] = ["0", "1", "2", "3"] # no segfault schema = pa.schema({"key" : pa.string()}) table = pa.Table.from_pydict(data, schema = schema) print("now writing out test file") pq.write_table(table, "test.parquet") ## -- End pasted text -- --------------------------------------------------------------------------- ArrowTypeError Traceback (most recent call last) <ipython-input-1-1ff07de63b32> in <module> 8 schema = pa.schema({"key" : pa.string()}) 9 ---> 10 table = pa.Table.from_pydict(data, schema = schema) 11 print("now writing out test file") 12 pq.write_table(table, "test.parquet") ~/code/arrow/python/pyarrow/types.pxi in __iter__() ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowTypeError: Expected a string or bytes object, got a 'int' object In ../src/arrow/python/common.h, line 241, code: FromBinary(obj, "a string or bytes object") In ../src/arrow/python/python_to_arrow.cc, line 549, code: string_view_.FromString(obj, &is_utf8) In ../src/arrow/python/python_to_arrow.cc, line 570, code: Append(obj, &is_full) In ../src/arrow/python/iterators.h, line 70, code: func(value, static_cast<int64_t>(i), &keep_going) In ../src/arrow/python/python_to_arrow.cc, line 1097, code: converter->AppendMultiple(seq, size) {code} Might want to add a unit test, though > Segfault when writing to parquet > -------------------------------- > > Key: ARROW-6573 > URL: https://issues.apache.org/jira/browse/ARROW-6573 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.14.1 > Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. > Using Anaconda distribution of Python 3.7. > Reporter: Josh Weinstock > Priority: Minor > > When attempting to write out a pyarrow table to parquet I am observing a > segfault when there is a mismatch between the schema and the datatypes. > Here is a reproducible example: > > {code:java} > import pyarrow as pa > import pyarrow.parquet as pq > data = dict() > data["key"] = [0, 1, 2, 3] # segfault > #data["key"] = ["0", "1", "2", "3"] # no segfault > schema = pa.schema({"key" : pa.string()}) > table = pa.Table.from_pydict(data, schema = schema) > print("now writing out test file") > pq.write_table(table, "test.parquet") > {code} > This results in a segfault when writing the table. Running > > {code:java} > gdb -ex r --args python test.py > {code} > Yields > > > {noformat} > Program received signal SIGSEGV, Segmentation fault. 0x00007fffe8173917 in > virtual thunk to > parquet::DictEncoderImpl<parquet::DataType<(parquet::Type::type)6> > >::Put(parquet::ByteArray const*, int) () from > /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14 > {noformat} > > > Thanks for all of your arrow work, > Josh -- This message was sent by Atlassian Jira (v8.3.4#803005)