[ 
https://issues.apache.org/jira/browse/ARROW-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-9455.
----------------------------------------
    Resolution: Duplicate

> [Python] add option for taking all columns from all files in pa.dataset
> -----------------------------------------------------------------------
>
>                 Key: ARROW-9455
>                 URL: https://issues.apache.org/jira/browse/ARROW-9455
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>            Reporter: David Cortes
>            Priority: Minor
>
> In PyArrow's dataset class, if I give it multiple parquet files in a list and 
> these parquet files have potentially different columns, it will always take 
> the schema from the first parquet file in the list, thus ignoring columns 
> that the first file doesn't have. Getting all columns within the files into 
> the same dataset implies passing a manual schema or constructing one by 
> iterating over the files and checking for their columns.
>  
> Would be nicer if PyArrow's dataset class could have an option to 
> automatically take all columns within the files from which it is constructed.
> {code:java}
> import numpy as np, pandas as pd
> df1 = pd.DataFrame({
>     "col1" : np.arange(10),
>     "col2" : np.random.choice(["a", "b"], size=10)
> })
> df2 = pd.DataFrame({
>     "col1" : np.arange(10, 20),
>     "col3" : np.random.random(size=10)
> })
> df1.to_parquet("df1.parquet")
> df2.to_parquet("df2.parquet"){code}
> {code:java}
> import pyarrow.dataset as pds
> ff = ["df1.parquet", "df2.parquet"]{code}
> {code:java}
> ### Code below will generate a DF with col1 and col2, but no col3{code}
> {code:java}
> pds.dataset(ff, format="parquet").to_table().to_pandas()
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to