[
https://issues.apache.org/jira/browse/ARROW-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339272#comment-17339272
]
Piotr Żelasko commented on ARROW-12588:
---------------------------------------
Thank you, you solved my issue :) Previously, I read in the documentation that
pa.array() works in >>simple cases<<, so I guess I assumed it won't work for
mine. But it did!
If it's interesting: I have lists of JSON manifests representing different
objects (in Lhotse [https://github.com/lhotse-speech/lhotse] -- a library for
speech data pipelines that I'm developing, where arrow currently helps me deal
with cases when metadata can be massive itself, e.g. for terabyte-sized speech
datasets). Some of them can compose others, and so the schema is quite complex,
e.g. these two items can be held in the same manifest:
Item #1
{{{'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions':
[\{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0,
'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0,
'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames':
100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start':
0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant',
'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources':
[{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate':
16000, 'num_samples': 160000, 'duration': 10.0}, 'type': 'Cut'}}}
Item #2
{{{'id': '3693dee0-1ac8-4f5a-a8c1-d6b4f6f80fbb', 'tracks': [\{'cut': {'id':
'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions': [{'id':
'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0,
'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0,
'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames':
100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start':
0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant',
'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources':
[{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate':
16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 0.0}, \{'cut':
{'id': 'cut-1', 'start': 0.0, 'duration': 10.0, 'channel': 0, 'supervisions':
[{'id': 'sup-1', 'recording_id': 'irrelevant', 'start': 0.5, 'duration': 6.0,
'channel': 0}, \{'id': 'sup-2', 'recording_id': 'irrelevant', 'start': 7.0,
'duration': 2.0, 'channel': 0}], 'features': \{'type': 'fbank', 'num_frames':
100, 'num_features': 40, 'frame_shift': 0.01, 'sampling_rate': 16000, 'start':
0.0, 'duration': 10.0, 'storage_type': 'lilcom', 'storage_path': 'irrelevant',
'storage_key': 'irrelevant'}, 'recording': \{'id': 'rec-1', 'sources':
[{'type': 'file', 'channels': [0], 'source': 'irrelevant'}], 'sampling_rate':
16000, 'num_samples': 160000, 'duration': 10.0}}, 'offset': 5.0, 'snr': 8}],
'type': 'MixedCut'}}}
It turns out that pa.array works just fine with a list of those.
> Expose JSON schema inference to Python API
> ------------------------------------------
>
> Key: ARROW-12588
> URL: https://issues.apache.org/jira/browse/ARROW-12588
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Piotr Żelasko
> Priority: Minor
>
> When using `pyarrow.json.read_json()`, the schema is automatically inferred.
> It would be useful to infer the schema from a json that is already loaded in
> memory (i.e. possibly a list of dicts in Python).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)