mikix opened a new issue, #45394:
URL: https://github.com/apache/arrow/issues/45394
### Describe the bug, including details regarding any error messages,
version, and platform.
### Description
If a file has one line of JSON, and there is no line ending at the end of
the file, `pyarrow.dataset.dataset` will correctly infer a schema, but then
fail to load the values from that row.
### Reproduction
Bug case:
```shell
echo -n '{"field": 1}' > test.json
python3 -c 'import pyarrow.dataset;
print(pyarrow.dataset.dataset("test.json",
format="json").to_table().to_pandas())'
field
0 NaN
```
With a newline it works as I'd expect:
```shell
echo '{"field": 1}' > test.json
python3 -c 'import pyarrow.dataset;
print(pyarrow.dataset.dataset("test.json",
format="json").to_table().to_pandas())'
field
0 1
```
With multiple lines but no final trailing newline it _also_ works as I'd
expect:
```shell
echo -en '{"field": 1}\n{"field": 2}' > test.json
python3 -c 'import pyarrow.dataset;
print(pyarrow.dataset.dataset("test.json",
format="json").to_table().to_pandas())'
field
0 1
1 2
```
You'll notice it's inferring the schema correctly in all cases (it knows
there's one field named `field`). If I change the data type to string, it also
changes the default null value correctly, even though the value shouldn't be
null:
```shell
echo -n '{"field": "value"}' > test.json
python3 -c 'import pyarrow.dataset;
print(pyarrow.dataset.dataset("test.json",
format="json").to_table().to_pandas())'
field
0 None
```
AFAICT, this only affects datasets. `pyarrow.json.read_json()` works just
fine, for example:
```shell
echo -n '{"field": 1}' > test.json
python3 -c 'import pyarrow.json;
print(pyarrow.json.read_json("test.json").to_pandas())'
field
0 1
```
### Debug info
Python: 3.12.3
PyArrow: 19.0.0
OS: Linux - Ubuntu 24.04.1
(I'm reporting this against the Python component because that's what easy
for me to test locally and where I saw the issue, but I assume that the actual
bug is lower in the stack.)
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]