[
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230091#comment-17230091
]
Joris Van den Bossche commented on ARROW-10344:
-----------------------------------------------
bq. .. or are there plans to fix that soon?
See my new comment on that issue. TLDR I don't think it is solvable in general.
bq. Could reading the metadata schema for Feather v1 also be supported
I think that is technically certainly possible. The C++ Reader interface
already exposes a {{schema()}} function, but this is not exposed in Python. I
suppose also for V2 this would be nice to have in the {{pyarrow.feather}}
module.
bq. We need the final data to be readable from Python and R, so Feather looked
like a good choice.
That's indeed one of the selling points of Feather, and I also didn't find any
up to date R interface for zarr.
I think it might still be worth looking for other options (giving the inherent
limitation for V2 mentioned above). I don't have any experience with it myself,
but might also be worth taking a look at TileDB.
If you want to stay with arrow/feather files, one other alternative is to use a
"trick" of putting all columns (of the same type) in a FixedSizeList column
(the data under the hood is then stored in a contiguous array, which can be
easily "viewed" as a 2D array). However, then you can no longer read only a
subset of the columns, which might be an important use case.
> [Python] Get all columns names (or schema) from Feather file, before loading
> whole Feather file
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-10344
> URL: https://issues.apache.org/jira/browse/ARROW-10344
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Gert Hulselmans
> Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)