[ https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217350#comment-17217350 ]
Joris Van den Bossche commented on ARROW-10344: ----------------------------------------------- [~ghuls] good question, this is not really well documented. This is possible (at least starting with pyarrow 1.0), but not directly with the {{pyarrow.feather}} module. Two options: 1) Since a feather file is basically the IPC serialization format written to a file, you can use the {{pyarrow.ipc}} functionality to interact with it (see http://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files). Small example: {code:python} # writing a small file import pyarrow as pa from pyarrow import feather table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]}) feather.write_feather(table, "data.feather") In [13]: import pyarrow.ipc In [14]: reader = pa.ipc.open_file("data.feather") In [15]: reader Out[15]: <pyarrow.ipc.RecordBatchFileReader at 0x7fe6d3d51798> In [16]: reader.schema Out[16]: a: int64 b: double {code} 2) Use the new Datasets API (http://arrow.apache.org/docs/python/dataset.html): {code:python} In [17]: import pyarrow.dataset as ds In [18]: dataset = ds.dataset("data.feather", format="feather") In [19]: dataset.schema Out[19]: a: int64 b: double In [20]: dataset.to_table().to_pandas() Out[20]: a b 0 1 0.1 1 2 0.2 2 3 0.3 {code} In addition, this datasets API also allows do directly filter rows using an expression while reading (http://arrow.apache.org/docs/python/dataset.html#filtering-data), and also can read from a collection of (partitioned) files at once. --- For both options you need Feather version 2 files (https://ursalabs.org/blog/2020-feather-v2/), so if you are already using Feather for a longer time (and have version 1 files), it might be worth to convert those. > [Python] Get all columns names (or schema) from Feather file, before loading > whole Feather file > ------------------------------------------------------------------------------------------------ > > Key: ARROW-10344 > URL: https://issues.apache.org/jira/browse/ARROW-10344 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Affects Versions: 1.0.1 > Reporter: Gert Hulselmans > Priority: Major > > Is there a way to get all column names (or schema) from a Feather file before > loading the full Feather file? > My Feather files are big (like 100GB) and the names of the columns are > different per analysis and can't be hard coded. > {code:python} > import pyarrow.feather as feather > # Code here to check which columns are in the feather file. > ... > my_columns = ... > # Result is pandas.DataFrame > read_df = feather.read_feather('/path/to/file', columns=my_columns) > # Result is pyarrow.Table > read_arrow = feather.read_table('/path/to/file', columns=my_columns) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)