[ 
https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guido Muscioni updated ARROW-13314:
-----------------------------------
    Attachment: 2020-2022 NY LG SG IND Gatekeeper contract codes.csv

> JSON parsing segment fault on long records (block_size) dependent
> -----------------------------------------------------------------
>
>                 Key: ARROW-13314
>                 URL: https://issues.apache.org/jira/browse/ARROW-13314
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Guido Muscioni
>            Priority: Major
>
> Hello,
>  
> I have a big JSON file (~300MB) with complex records (nested json, nested 
> lists of jsons). When I try to read this with pyarrow I am getting a 
> segmentation fault. I tried then couple of things from read options, please 
> see the code below (I developed this code on an example file that was 
> attached here: https://issues.apache.org/jira/browse/ARROW-9612):
>  
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import tqdm
>     if __name__ == '__main__':
>          source = 'wiki_04.jsonl'
>          ro = ReadOptions(block_size=2**20)
>          with open(source, 'r') as file:
>              for i, line in tqdm.tqdm(enumerate(file)):
>                  with open('temp_file_arrow_3.ndjson', 'a') as file2:
>                      file2.write(line)
>                  json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
> {code}
> For both the example file and my file, this code will return the straddling 
> object exception (or seg fault) once the file reach the block_size. 
> Increasing the block_size will make the code fail later.
> Then I tried, on my file, to put an explicit schema:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          df = pd.read_json(source, lines=True) 
>          table_schema = pa.Table.from_pandas(df).schema
>          
>          ro = ReadOptions(explicit_schema = table_schema)
>          table = json.read_json(source, read_options=ro)         
> {code}
> This works, which may suggest that this issue, and the issue of the linked 
> JIRA issue, are only appearing when an explicit schema is not provided. 
> Additionally the following code works as well:
> {code:python}
>     from pyarrow import json
>     from pyarrow.json import ReadOptions
>     import pandas as pd
>     if __name__ == '__main__':
>          source = 'my_file.jsonl'
>          
>          ro = ReadOptions(block_size = 2**30)
>          table = json.read_json(source, read_options=ro)         
> {code}
> The block_size is bigger than my file in this case. Is it possible that the 
> schema is defined in the first block and then if the schema changes, I get a 
> seg fault?
> I cannot share my json file, however, I hope that someone could add some 
> clarity on what I am seeing and maybe suggest a workaround.
> Thank you,
>  Guido



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to