[ 
https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383377#comment-17383377
 ] 

Alessandro Molina commented on ARROW-13314:
-------------------------------------------

I was able to reproduce the issue locally. I seem to get the abort/segfault 
only when arrow is built in debug mode by the way. Otherwise it seems to freeze 
waiting for some thread.

This is the mentioned exception
{code}
Traceback (most recent call last):
  File "/home/amol/ARROW/tries/read.py", line 5, in <module>
    json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
  File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try 
to increase block size?)
{code}

In debug mode I also get those two extra errors
{code}
pure virtual method called
terminate called without an active exception
{code}

and the traceback I could get from gdb looks like
{code}
#4  0x00007ffff39a5567 in std::terminate() () from 
/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff39a62e5 in __cxa_pure_virtual () from 
/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff5ed13f0 in 
arrow::json::ChunkedStructArrayBuilder::InsertChildren (this=0xb89ae0, 
block_index=0, 
    unconverted=...) at src/arrow/json/chunked_builder.cc:396
#7  0x00007ffff5ed0321 in arrow::json::ChunkedStructArrayBuilder::Insert 
(this=0xb89ae0, block_index=0, 
    unconverted=std::shared_ptr<arrow::Array> (use count 1, weak count 0) = 
{...})
    at src/arrow/json/chunked_builder.cc:320
#8  0x00007ffff5f2ba61 in arrow::json::TableReaderImpl::ParseAndInsert 
(this=0xc489b0, 
    partial=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, 
    completion=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = 
{...}, 
    whole=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, 
block_index=0)
    at src/arrow/json/reader.cc:158
#9  0x00007ffff5f2a331 in 
arrow::json::TableReaderImpl::Read()::{lambda()#1}::operator()() const 
(__closure=0xca6cb8)
    at src/arrow/json/reader.cc:104
...
{code}

> JSON parsing segment fault on long records (block_size) dependent
> -----------------------------------------------------------------
>
>                 Key: ARROW-13314
>                 URL: https://issues.apache.org/jira/browse/ARROW-13314
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Guido Muscioni
>            Priority: Major
>
> Hello,
>  
> I have a big JSON file (~300MB) with complex records (nested json, nested 
> lists of jsons). When I try to read this with pyarrow I am getting a 
> segmentation fault. I tried then couple of things from read options, please 
> see the code below (I developed this code on an example file that was 
> attached here: https://issues.apache.org/jira/browse/ARROW-9612):
>  
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import tqdm
>     if __name__ == '__main__':
>          source = 'wiki_04.jsonl'
>          ro = ReadOptions(block_size=2**20)
>          with open(source, 'r') as file:
>              for i, line in tqdm.tqdm(enumerate(file)):
>                  with open('temp_file_arrow_3.ndjson', 'a') as file2:
>                      file2.write(line)
>                  json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
> {code}
> For both the example file and my file, this code will return the straddling 
> object exception (or seg fault) once the file reach the block_size. 
> Increasing the block_size will make the code fail later.
> Then I tried, on my file, to put an explicit schema:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          df = pd.read_json(source, lines=True) 
>          table_schema = pa.Table.from_pandas(df).schema
>          
>          ro = ReadOptions(explicit_schema = table_schema)
>          table = json.read_json(source, read_options=ro)         
> {code}
> This works, which may suggest that this issue, and the issue of the linked 
> JIRA issue, are only appearing when an explicit schema is not provided. 
> Additionally the following code works as well:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          
>          ro = ReadOptions(block_size = 2**30)
>          table = json.read_json(source, read_options=ro)         
> {code}
> The block_size is bigger than my file in this case. Is it possible that the 
> schema is defined in the first block and then if the schema changes, I get a 
> seg fault?
> I cannot share my json file, however, I hope that someone could add some 
> clarity on what I am seeing and maybe suggest a workaround.
> Thank you,
>  Guido



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to