[jira] [Commented] (ARROW-5568) [Python] Allow parsing more general JSON formats

Joris Van den Bossche (JIRA) Tue, 11 Jun 2019 23:29:03 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861790#comment-16861790
 ]


Joris Van den Bossche commented on ARROW-5568:
----------------------------------------------

{quote}I have JSON data where the columnar (line-delimited) part is in a `data` 
subkey:{quote}

Note that the {{data}} subpart is not line delimited, but a comma-delimited 
JSON array. So that's a first thing that would be good to support.

Some additional resources that might be useful: in pandas there are many 
formats supported, called "orients", see the overview table at 
http://pandas.pydata.org/pandas-docs/version/0.24/user_guide/io.html#reading-json
 (disclaimer: I don't know how common the different formats are, so it doesn't 
necessarily makes sense to copy them all from pandas).

One of the formats is the JSON Table Schema 
(https://frictionlessdata.io/specs/table-schema/), which is a json file with a 
{{'metadata'}} and {{'data'}} top-level keys, where the {{'data'}} then 
consists of comma-delimited records (so very similar in structure as what 
[~dhirschfeld] showed above).

> [Python] Allow parsing more general JSON formats
> ------------------------------------------------
>
>                 Key: ARROW-5568
>                 URL: https://issues.apache.org/jira/browse/ARROW-5568
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Dave Hirschfeld
>            Priority: Minor
>
> I have JSON data where the columnar (line-delimited) part is in a `data` 
> subkey:
> {code:java}
> {
>   "metadata": {"name": "block1"},
>   "data" : [
>     {"a": 1, "b": 2.0, "c": "foo", "d": false},
>     {"a": 4, "b": -5.5, "c": null, "d": true}
>   ]
> }
> {code}
>  
>  
> It would be good if the arrow JSON parser could allow specifying where the 
> columnar data is stored.
> Since the `metadata` is also important to me it would be even better if the 
> rest of the JSON could be returned as a Python dict with the only the 
> specified keys parsed as arrow tables - e.g.
>  
> {code:java}
> >>> block1 = json.read_json(fn, tables=['data'])
> >>> block1['data']
> pyarrow.Table
> a: int64
> b: double
> c: string
> d: bool
> >>> block1['metadata']
> {'name': 'block1'}
> >>> block1
> {
>   "metadata": {"name": "block1"},
>   "data" : pyarrow.Table
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5568) [Python] Allow parsing more general JSON formats

Reply via email to