[jira] [Commented] (ARROW-12588) Expose JSON schema inference to Python API

Jira Tue, 04 May 2021 12:57:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339272#comment-17339272
 ]


Piotr Żelasko commented on ARROW-12588:
---------------------------------------

Thank you, you solved my issue :) Previously, I read in the documentation that 
pa.array() works in >>simple cases<<, so I guess I assumed it won't work for 
mine. But it did!

If it's interesting: I have lists of JSON manifests representing different 
objects (in Lhotse [https://github.com/lhotse-speech/lhotse] -- a library for 
speech data pipelines that I'm developing, where arrow currently helps me deal 
with cases when metadata can be massive itself, e.g. for terabyte-sized speech 
datasets). Some of them can compose others, and so the schema is quite complex, 
e.g. these two items can be held in the same manifest:

Item #1

{{{'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': 
[\{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 
'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 
'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 
100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 
0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 
'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': 
[{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 
16000, 'num_samples': 160000, 'duration': 10.0}, 'type': 'Cut'}}}

Item #2

{{{'id': '3693dee0-1ac8-4f5a-a8c1-d6b4f6f80fbb', 'tracks': [\{'cut': {'id': 
'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [{'id': 
'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 
'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 
'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 
100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 
0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 
'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': 
[{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 
16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 0.0}, \{'cut': 
{'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': 
[{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0, 
'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0, 
'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames': 
100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start': 
0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant', 
'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources': 
[{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate': 
16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 5.0, 'snr': 8}], 
'type': 'MixedCut'}}}

It turns out that pa.array works just fine with a list of those.

> Expose JSON schema inference to Python API
> ------------------------------------------
>
>                 Key: ARROW-12588
>                 URL: https://issues.apache.org/jira/browse/ARROW-12588
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Piotr Żelasko
>            Priority: Minor
>
> When using `pyarrow.json.read_json()`, the schema is automatically inferred. 
> It would be useful to infer the schema from a json that is already loaded in 
> memory (i.e. possibly a list of dicts in Python).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12588) Expose JSON schema inference to Python API

Reply via email to