[ https://issues.apache.org/jira/browse/ARROW-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guido Muscioni updated ARROW-13314: ----------------------------------- Attachment: 2020-2022 NY LG SG IND Gatekeeper contract codes.csv > JSON parsing segment fault on long records (block_size) dependent > ----------------------------------------------------------------- > > Key: ARROW-13314 > URL: https://issues.apache.org/jira/browse/ARROW-13314 > Project: Apache Arrow > Issue Type: Bug > Reporter: Guido Muscioni > Priority: Major > > Hello, > > I have a big JSON file (~300MB) with complex records (nested json, nested > lists of jsons). When I try to read this with pyarrow I am getting a > segmentation fault. I tried then couple of things from read options, please > see the code below (I developed this code on an example file that was > attached here: https://issues.apache.org/jira/browse/ARROW-9612): > > {code:python} > from pyarrow import json > from pyarrow.json import ReadOptions > import tqdm > if __name__ == '__main__': > source = 'wiki_04.jsonl' > ro = ReadOptions(block_size=2**20) > with open(source, 'r') as file: > for i, line in tqdm.tqdm(enumerate(file)): > with open('temp_file_arrow_3.ndjson', 'a') as file2: > file2.write(line) > json.read_json('temp_file_arrow_3.ndjson', read_options=ro) > {code} > For both the example file and my file, this code will return the straddling > object exception (or seg fault) once the file reach the block_size. > Increasing the block_size will make the code fail later. > Then I tried, on my file, to put an explicit schema: > {code:python} > from pyarrow import json > from pyarrow.json import ReadOptions > import pandas as pd > if __name__ == '__main__': > source = 'my_file.jsonl' > df = pd.read_json(source, lines=True) > table_schema = pa.Table.from_pandas(df).schema > > ro = ReadOptions(explicit_schema = table_schema) > table = json.read_json(source, read_options=ro) > {code} > This works, which may suggest that this issue, and the issue of the linked > JIRA issue, are only appearing when an explicit schema is not provided. > Additionally the following code works as well: > {code:python} > from pyarrow import json > from pyarrow.json import ReadOptions > import pandas as pd > if __name__ == '__main__': > source = 'my_file.jsonl' > > ro = ReadOptions(block_size = 2**30) > table = json.read_json(source, read_options=ro) > {code} > The block_size is bigger than my file in this case. Is it possible that the > schema is defined in the first block and then if the schema changes, I get a > seg fault? > I cannot share my json file, however, I hope that someone could add some > clarity on what I am seeing and maybe suggest a workaround. > Thank you, > Guido -- This message was sent by Atlassian Jira (v8.3.4#803005)