[ https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218187#comment-17218187 ]
Gert Hulselmans commented on ARROW-10344: ----------------------------------------- We need the final data to be readable from Python and R, so Feather looked like a good choice. To create the dataset the data is generated: - 30k columns with 1M entries (data is generated separately for each of the 30k columns): - This first part I had split up before so each feather file (of 10 files) had 3k columns (I can transpose those so I can used the dataset API). - transpose data ==> 1M columns with 30k entries Transposed data needs to be usable from python and R: around 20k columns (all 30k values) from 1M are extracted in each analysis. > [Python] Get all columns names (or schema) from Feather file, before loading > whole Feather file > ------------------------------------------------------------------------------------------------ > > Key: ARROW-10344 > URL: https://issues.apache.org/jira/browse/ARROW-10344 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Affects Versions: 1.0.1 > Reporter: Gert Hulselmans > Priority: Major > > Is there a way to get all column names (or schema) from a Feather file before > loading the full Feather file? > My Feather files are big (like 100GB) and the names of the columns are > different per analysis and can't be hard coded. > {code:python} > import pyarrow.feather as feather > # Code here to check which columns are in the feather file. > ... > my_columns = ... > # Result is pandas.DataFrame > read_df = feather.read_feather('/path/to/file', columns=my_columns) > # Result is pyarrow.Table > read_arrow = feather.read_table('/path/to/file', columns=my_columns) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)