[ https://issues.apache.org/jira/browse/ARROW-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche closed ARROW-9455. ---------------------------------------- Resolution: Duplicate > [Python] add option for taking all columns from all files in pa.dataset > ----------------------------------------------------------------------- > > Key: ARROW-9455 > URL: https://issues.apache.org/jira/browse/ARROW-9455 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Reporter: David Cortes > Priority: Minor > > In PyArrow's dataset class, if I give it multiple parquet files in a list and > these parquet files have potentially different columns, it will always take > the schema from the first parquet file in the list, thus ignoring columns > that the first file doesn't have. Getting all columns within the files into > the same dataset implies passing a manual schema or constructing one by > iterating over the files and checking for their columns. > > Would be nicer if PyArrow's dataset class could have an option to > automatically take all columns within the files from which it is constructed. > {code:java} > import numpy as np, pandas as pd > df1 = pd.DataFrame({ > "col1" : np.arange(10), > "col2" : np.random.choice(["a", "b"], size=10) > }) > df2 = pd.DataFrame({ > "col1" : np.arange(10, 20), > "col3" : np.random.random(size=10) > }) > df1.to_parquet("df1.parquet") > df2.to_parquet("df2.parquet"){code} > {code:java} > import pyarrow.dataset as pds > ff = ["df1.parquet", "df2.parquet"]{code} > {code:java} > ### Code below will generate a DF with col1 and col2, but no col3{code} > {code:java} > pds.dataset(ff, format="parquet").to_table().to_pandas() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)