davidlghellin opened a new pull request, #21264:
URL: https://github.com/apache/datafusion/pull/21264
## Which issue does this PR close?
- Part of #15914
## Rationale for this change
`json_tuple` was parsing JSON numbers with `serde_json::Value` and
re-serializing them via `Number::to_string()`. This loses precision in two
cases verified against Spark 4.1.1:
| Input | Spark | DataFusion (before) |
|---|---|---|
| `1.5e10` | `1.5E10` | `15000000000.0` |
| `99999999999999999999` | `99999999999999999999` | `1e+20` |
## What changes are included in this PR?
- Switched from `serde_json::Value` to `HashMap<String, Box<RawValue>>` in
`json_tuple_inner` to preserve original JSON text for numbers
- Added `raw_value` feature to `serde_json` dependency in `datafusion-spark`
(lightweight, no behavior change for other code)
- Spark uppercases exponent notation (`1.5e10` → `1.5E10`), handled with a
simple `replace('e', "E")`
- Added 8 new SLT tests: scientific notation, large integers, normal
int/float, trailing comma, empty key, "null" as key, interleaved exists/missing
fields
- Added 5 unit tests for number precision edge cases
## Are these changes tested?
Yes.
- 7 unit tests in `json_tuple.rs` (5 new for number precision + 2 existing)
- 27 SLT tests in `spark/json/json_tuple.slt` (8 new + 19 existing)
- All results validated against Spark 4.1.1
## Are there any user-facing changes?
Yes — `json_tuple` now returns the original JSON number text instead of a
re-serialized float. This is a correctness fix aligning with Spark behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]