[ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621277#comment-17621277
 ] 

Ben Harkins commented on ARROW-18106:
-------------------------------------

That is indeed unexpected... especially since it comes back as a plain string 
in the first case. I suspect it's an issue with timestamps specifically (or 
potentially any non-string type with a json string representation). Test 
coverage seems to be lacking in this area.

I'll take a look at it.

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-18106
>                 URL: https://issues.apache.org/jira/browse/ARROW-18106
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Harkins
>            Priority: Major
>              Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> ----
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to