[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

Joris Van den Bossche (Jira) Wed, 11 Nov 2020 08:52:59 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230091#comment-17230091
 ]


Joris Van den Bossche commented on ARROW-10344:
-----------------------------------------------

bq. .. or are there plans to fix that soon?

See my new comment on that issue. TLDR I don't think it is solvable in general.

bq. Could reading the metadata schema for Feather v1 also be supported

I think that is technically certainly possible. The C++ Reader interface 
already exposes a {{schema()}} function, but this is not exposed in Python. I 
suppose also for V2 this would be nice to have in the {{pyarrow.feather}} 
module.

bq. We need the final data to be readable from Python and R, so Feather looked 
like a good choice.

That's indeed one of the selling points of Feather, and I also didn't find any 
up to date R interface for zarr.   
I think it might still be worth looking for other options (giving the inherent 
limitation for V2 mentioned above). I don't have any experience with it myself, 
but might also be worth taking a look at TileDB.

If you want to stay with arrow/feather files, one other alternative is to use a 
"trick" of putting all columns (of the same type) in a FixedSizeList column 
(the data under the hood is then stored in a contiguous array, which can be 
easily "viewed" as a 2D array). However, then you can no longer read only a 
subset of the columns, which might be an important use case.






> [Python]  Get all columns names (or schema) from Feather file, before loading 
> whole Feather file
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10344
>                 URL: https://issues.apache.org/jira/browse/ARROW-10344
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Gert Hulselmans
>            Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before 
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are 
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

Reply via email to