hi all, i noticed a weird behavior to when spark parses nested json with schema conflict.
i also just noticed that spark "fixed" this in the most recent release 3.5.0 but since i'm working with AWS services being: * EMR 6: spark 3.3.* spark3.4.* * Glue 3: spark3.1.1 * Glue 4: spark 3.3.0 https://docs.aws.amazon.com/glue/latest/dg/release-notes.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html ..we're still facing this issue company-wide. the problem was that spark is silently dropping records (or even whole files) when there's an schema conflict or empty string values ( https://kb.databricks.com/notebooks/json-reader-parses-value-as-null). My whole concern here is that spark is not even echoing a warn or error or exception when such cases occurs. To reproduce: i'm using amaznlinux2 with Python 3.7.16 and pyspark==3.4.1. echo '{"Records":[{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}]}' > one.json echo '{"Records":[{"evtid":"2","requestParameters":{"lol":{},"lol2":{}}},{"evtid":"3","requestParameters":{"DescribeHostsRequest":""}}]}' > two.json Output of command: > spark.read.json(["one.json", "two.json"]).show() using 3.1.0 and 3.3.0 +--------------+ | Records| +--------------+ | null| |[{1, {{500}}}]| +--------------+ drops the second (two.json) file using Spark 3.4.0 +--------------+ | Records| +--------------+ | null| |[{1, {{500}}}]| +--------------+ it completely drops the second (two.json) file. Spark 3.5.0 +--------------------+ | Records| +--------------------+ |[{2, {NULL}}, {3,...| | [{1, {{500}}}]| +--------------------+ it reads both files but completely drops the "requestParameters" content of all the records in the second (two.json) file. {"evtid":"2","requestParameters":{}} <-- not good {"evtid":"3","requestParameters":{}} <-- not good {"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}} enabling spark.conf.set("spark.sql.legacy.json.allowEmptyString.enabled", True) as suggested by https://kb.databricks.com/notebooks/json-reader-parses-value-as-null in spark 3.1 and 3.3 yields the same result seen in spark 3.5. which is not ideal if one wants the later fetch the records as is. to this. the only solution I found was to explicitly enforce the schema when reading. that said. does anyone the exact thread or changelog where this issue was handled? i've checked it on the links below but was non conclusive: https://spark.apache.org/docs/latest/sql-migration-guide.html https://spark.apache.org/releases/spark-release-3-5-0.html another question. how would you guys handle this scenario? I could not see a clue even after enabling full verbose. I could maybe force spark to issue an exception when such a case is encountered. or maybe send those bad/broken records to another file or bucket (dlq-ish) best regards,c.