[
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217350#comment-17217350
]
Joris Van den Bossche commented on ARROW-10344:
-----------------------------------------------
[~ghuls] good question, this is not really well documented.
This is possible (at least starting with pyarrow 1.0), but not directly with
the {{pyarrow.feather}} module.
Two options:
1) Since a feather file is basically the IPC serialization format written to a
file, you can use the {{pyarrow.ipc}} functionality to interact with it (see
http://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files).
Small example:
{code:python}
# writing a small file
import pyarrow as pa
from pyarrow import feather
table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
feather.write_feather(table, "data.feather")
In [13]: import pyarrow.ipc
In [14]: reader = pa.ipc.open_file("data.feather")
In [15]: reader
Out[15]: <pyarrow.ipc.RecordBatchFileReader at 0x7fe6d3d51798>
In [16]: reader.schema
Out[16]:
a: int64
b: double
{code}
2) Use the new Datasets API (http://arrow.apache.org/docs/python/dataset.html):
{code:python}
In [17]: import pyarrow.dataset as ds
In [18]: dataset = ds.dataset("data.feather", format="feather")
In [19]: dataset.schema
Out[19]:
a: int64
b: double
In [20]: dataset.to_table().to_pandas()
Out[20]:
a b
0 1 0.1
1 2 0.2
2 3 0.3
{code}
In addition, this datasets API also allows do directly filter rows using an
expression while reading
(http://arrow.apache.org/docs/python/dataset.html#filtering-data), and also can
read from a collection of (partitioned) files at once.
---
For both options you need Feather version 2 files
(https://ursalabs.org/blog/2020-feather-v2/), so if you are already using
Feather for a longer time (and have version 1 files), it might be worth to
convert those.
> [Python] Get all columns names (or schema) from Feather file, before loading
> whole Feather file
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-10344
> URL: https://issues.apache.org/jira/browse/ARROW-10344
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Gert Hulselmans
> Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)