[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

Joris Van den Bossche (Jira) Mon, 19 Oct 2020 23:56:13 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217350#comment-17217350
 ]


Joris Van den Bossche commented on ARROW-10344:
-----------------------------------------------

[~ghuls] good question, this is not really well documented. 

This is possible (at least starting with pyarrow 1.0), but not directly with 
the {{pyarrow.feather}} module. 

Two options:

1) Since a feather file is basically the IPC serialization format written to a 
file, you can use the {{pyarrow.ipc}} functionality to interact with it (see 
http://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files).
  
Small example:

{code:python} 
# writing a small file
import pyarrow as pa
from pyarrow import feather
table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
feather.write_feather(table, "data.feather")

In [13]: import pyarrow.ipc

In [14]: reader = pa.ipc.open_file("data.feather")

In [15]: reader
Out[15]: <pyarrow.ipc.RecordBatchFileReader at 0x7fe6d3d51798>

In [16]: reader.schema
Out[16]: 
a: int64
b: double
{code}

2) Use the new Datasets API (http://arrow.apache.org/docs/python/dataset.html):

{code:python}
In [17]: import pyarrow.dataset as ds

In [18]: dataset = ds.dataset("data.feather", format="feather")

In [19]: dataset.schema
Out[19]: 
a: int64
b: double

In [20]: dataset.to_table().to_pandas()
Out[20]: 
   a    b
0  1  0.1
1  2  0.2
2  3  0.3
{code}

In addition, this datasets API also allows do directly filter rows using an 
expression while reading 
(http://arrow.apache.org/docs/python/dataset.html#filtering-data), and also can 
read from a collection of (partitioned) files at once.

---

For both options you need Feather version 2 files 
(https://ursalabs.org/blog/2020-feather-v2/), so if you are already using 
Feather for a longer time (and have version 1 files), it might be worth to 
convert those.

> [Python]  Get all columns names (or schema) from Feather file, before loading 
> whole Feather file
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10344
>                 URL: https://issues.apache.org/jira/browse/ARROW-10344
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Gert Hulselmans
>            Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before 
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are 
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

Reply via email to