[ https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383377#comment-17383377 ]
Alessandro Molina commented on ARROW-13314: ------------------------------------------- I was able to reproduce the issue locally. I seem to get the abort/segfault only when arrow is built in debug mode by the way. Otherwise it seems to freeze waiting for some thread. This is the mentioned exception {code} Traceback (most recent call last): File "/home/amol/ARROW/tries/read.py", line 5, in <module> json.read_json('temp_file_arrow_3.ndjson', read_options=ro) File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?) {code} In debug mode I also get those two extra errors {code} pure virtual method called terminate called without an active exception {code} and the traceback I could get from gdb looks like {code} #4 0x00007ffff39a5567 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007ffff39a62e5 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff5ed13f0 in arrow::json::ChunkedStructArrayBuilder::InsertChildren (this=0xb89ae0, block_index=0, unconverted=...) at src/arrow/json/chunked_builder.cc:396 #7 0x00007ffff5ed0321 in arrow::json::ChunkedStructArrayBuilder::Insert (this=0xb89ae0, block_index=0, unconverted=std::shared_ptr<arrow::Array> (use count 1, weak count 0) = {...}) at src/arrow/json/chunked_builder.cc:320 #8 0x00007ffff5f2ba61 in arrow::json::TableReaderImpl::ParseAndInsert (this=0xc489b0, partial=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, completion=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, whole=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, block_index=0) at src/arrow/json/reader.cc:158 #9 0x00007ffff5f2a331 in arrow::json::TableReaderImpl::Read()::{lambda()#1}::operator()() const (__closure=0xca6cb8) at src/arrow/json/reader.cc:104 ... {code} > JSON parsing segment fault on long records (block_size) dependent > ----------------------------------------------------------------- > > Key: ARROW-13314 > URL: https://issues.apache.org/jira/browse/ARROW-13314 > Project: Apache Arrow > Issue Type: Bug > Reporter: Guido Muscioni > Priority: Major > > Hello, > > I have a big JSON file (~300MB) with complex records (nested json, nested > lists of jsons). When I try to read this with pyarrow I am getting a > segmentation fault. I tried then couple of things from read options, please > see the code below (I developed this code on an example file that was > attached here: https://issues.apache.org/jira/browse/ARROW-9612): > > {code:python} > from pyarrow import json > from pyarrow.json import ReadOptions > import tqdm > if __name__ == '__main__': > source = 'wiki_04.jsonl' > ro = ReadOptions(block_size=2**20) > with open(source, 'r') as file: > for i, line in tqdm.tqdm(enumerate(file)): > with open('temp_file_arrow_3.ndjson', 'a') as file2: > file2.write(line) > json.read_json('temp_file_arrow_3.ndjson', read_options=ro) > {code} > For both the example file and my file, this code will return the straddling > object exception (or seg fault) once the file reach the block_size. > Increasing the block_size will make the code fail later. > Then I tried, on my file, to put an explicit schema: > {code:python} > from pyarrow import json > from pyarrow.json import ReadOptions > import pandas as pd > if __name__ == '__main__': > source = 'my_file.jsonl' > df = pd.read_json(source, lines=True) > table_schema = pa.Table.from_pandas(df).schema > > ro = ReadOptions(explicit_schema = table_schema) > table = json.read_json(source, read_options=ro) > {code} > This works, which may suggest that this issue, and the issue of the linked > JIRA issue, are only appearing when an explicit schema is not provided. > Additionally the following code works as well: > {code:python} > from pyarrow import json > from pyarrow.json import ReadOptions > import pandas as pd > if __name__ == '__main__': > source = 'my_file.jsonl' > > ro = ReadOptions(block_size = 2**30) > table = json.read_json(source, read_options=ro) > {code} > The block_size is bigger than my file in this case. Is it possible that the > schema is defined in the first block and then if the schema changes, I get a > seg fault? > I cannot share my json file, however, I hope that someone could add some > clarity on what I am seeing and maybe suggest a workaround. > Thank you, > Guido -- This message was sent by Atlassian Jira (v8.3.4#803005)