[ https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621277#comment-17621277 ]
Ben Harkins commented on ARROW-18106: ------------------------------------- That is indeed unexpected... especially since it comes back as a plain string in the first case. I suspect it's an issue with timestamps specifically (or potentially any non-string type with a json string representation). Test coverage seems to be lacking in this area. I'll take a look at it. > [C++] JSON reader ignores explicit schema with default > unexpected_field_behavior="infer" > ---------------------------------------------------------------------------------------- > > Key: ARROW-18106 > URL: https://issues.apache.org/jira/browse/ARROW-18106 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Ben Harkins > Priority: Major > Labels: json > > Not 100% sure this is a "bug", but at least I find it an unexpected interplay > between two options. > By default, when reading json, we _infer_ the data type of columns, and when > specifying an explicit schema, we _also_ by default infer the type of columns > that are not specified in the explicit schema. The docs for > {{unexpected_field_behavior}}: > > How JSON fields outside of explicit_schema (if given) are treated > But it seems that if you specify a schema, and the parsing of one of the > columns fails according to that schema, we still fall back to this default of > inferring the data type (while I would have expected an error, since we > should only infer for columns _not_ in the schema. > Example code using pyarrow: > {code:python} > import io > import pyarrow as pa > from pyarrow import json > s_json = """{"column":"2022-09-05T08:08:46.000"}""" > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))])) > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > The parsing fails here because there are milliseconds and the type is "s", > but the explicit schema is ignored, and we get a result with a string column > as result: > {code} > pyarrow.Table > column: string > ---- > column: [["2022-09-05T08:08:46.000"]] > {code} > But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the > expected parse error: > {code:python} > opts = json.ParseOptions(explicit_schema=pa.schema([("column", > pa.timestamp("s"))]), unexpected_field_behavior="ignore") > json.read_json(io.BytesIO(s_json.encode()), parse_options=opts) > {code} > gives > {code} > ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't > parse:2022-09-05T08:08:46.000 > {code} > It might be this is specific to timestamps, I don't directly see a similar > issue with eg {{"column": "A"}} and setting the schema to "column" being > int64. -- This message was sent by Atlassian Jira (v8.20.10#820010)