Here are some findings that I believe violate the JSON standard. With the newly introduced configuration, we expect the JSON parsing behavior to align with the standard. There may still be additional uncovered cases.
1. Single Quotes for String Data: Spark allows the use of single quotes to represent string data through the `allowSingleQuotes` option (default: true). When enabled, single quotes are permitted. However, according to the JSON standard, single quotes are not valid and should be treated as malformed JSON. 2. JSON Data Containing Non-escaped Characters: As per https://www.rfc-editor.org/rfc/rfc8259, certain characters must be escaped in JSON, including quotation marks, reverse solidus, and control characters (U+0000 to U+001F). For instance, Spark allows non-escaped "\n" in JSON string data, but this violates the JSON standard and should be considered broken data. For more detailed discussion, see this reference: https://lemire.me/blog/2025/07/04/just-say-no-to-broken-json/. Thanks, Philo On Thu, Sep 4, 2025 at 6:33 PM Wenchen Fan <[email protected]> wrote: > Do we have a list of behaviors we want to change after enabling the new > config? > > On Thu, Sep 4, 2025 at 5:38 PM Philo <[email protected]> wrote: > >> Hi all, >> >> I am writing to initiate a discussion on enhancing Spark JSON parsing to >> support standard compliance. >> >> ## Motivation >> In the current version of Spark, the JSON parser is designed to be >> compatible with Hive, which means that some behaviors may not adhere >> strictly to standard JSON compliance. For instance, it permits the use >> of single quotes in JSON parsing for functions such as get_json_object >> and json_tuple. Additionally, there are other instances where the >> behavior diverges from standard JSON practices. >> >> ## Proposal >> To address these inconsistencies, we propose introducing a configuration >> option that allows users to choose between legacy behavior and standard >> compliance. This approach ensures that existing workflows remain unaffected >> while providing the flexibility to adopt standard JSON practices if desired. >> >> See JIRA: https://issues.apache.org/jira/browse/SPARK-53281 >> >> Thanks, >> Philo >> >
