mikix opened a new issue, #45394:
URL: https://github.com/apache/arrow/issues/45394

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ### Description
   If a file has one line of JSON, and there is no line ending at the end of 
the file, `pyarrow.dataset.dataset` will correctly infer a schema, but then 
fail to load the values from that row.
   
   ### Reproduction
   Bug case:
   ```shell
   echo -n '{"field": 1}' > test.json
   python3 -c 'import pyarrow.dataset; 
print(pyarrow.dataset.dataset("test.json", 
format="json").to_table().to_pandas())'
      field
   0    NaN
   ```
   
   With a newline it works as I'd expect:
   ```shell
   echo '{"field": 1}' > test.json
   python3 -c 'import pyarrow.dataset; 
print(pyarrow.dataset.dataset("test.json", 
format="json").to_table().to_pandas())'
      field
   0    1
   ```
   
   With multiple lines but no final trailing newline it _also_ works as I'd 
expect:
   ```shell
   echo -en '{"field": 1}\n{"field": 2}' > test.json
   python3 -c 'import pyarrow.dataset; 
print(pyarrow.dataset.dataset("test.json", 
format="json").to_table().to_pandas())'
      field
   0    1
   1    2
   ```
   
   You'll notice it's inferring the schema correctly in all cases (it knows 
there's one field named `field`). If I change the data type to string, it also 
changes the default null value correctly, even though the value shouldn't be 
null:
   ```shell
   echo -n '{"field": "value"}' > test.json
   python3 -c 'import pyarrow.dataset; 
print(pyarrow.dataset.dataset("test.json", 
format="json").to_table().to_pandas())'
      field
   0    None
   ```
   
   AFAICT, this only affects datasets. `pyarrow.json.read_json()` works just 
fine, for example:
   ```shell
   echo -n '{"field": 1}' > test.json
   python3 -c 'import pyarrow.json; 
print(pyarrow.json.read_json("test.json").to_pandas())'
      field
   0      1
   ```
   
   ### Debug info
   Python: 3.12.3
   PyArrow: 19.0.0
   OS: Linux - Ubuntu 24.04.1
   
   (I'm reporting this against the Python component because that's what easy 
for me to test locally and where I saw the issue, but I assume that the actual 
bug is lower in the stack.)
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to