David Cortes created ARROW-9455:
-----------------------------------

             Summary: Request: add option for taking all columns from all files 
in pa.dataset
                 Key: ARROW-9455
                 URL: https://issues.apache.org/jira/browse/ARROW-9455
             Project: Apache Arrow
          Issue Type: Wish
          Components: Python
            Reporter: David Cortes


In PyArrow's dataset class, if I give it multiple parquet files in a list and 
these parquet files have potentially different columns, it will always take the 
schema from the first parquet file in the list, thus ignoring columns that the 
first file doesn't have. Getting all columns within the files into the same 
dataset implies passing a manual schema or constructing one by iterating over 
the files and checking for their columns.

 

Would be nicer if PyArrow's dataset class could have an option to automatically 
take all columns within the files from which it is constructed.
{code:java}
import numpy as np, pandas as pd
df1 = pd.DataFrame({
    "col1" : np.arange(10),
    "col2" : np.random.choice(["a", "b"], size=10)
})
df2 = pd.DataFrame({
    "col1" : np.arange(10, 20),
    "col3" : np.random.random(size=10)
})
df1.to_parquet("df1.parquet")
df2.to_parquet("df2.parquet"){code}
{code:java}
import pyarrow.dataset as pds
ff = ["df1.parquet", "df2.parquet"]{code}
{code:java}
### Code below will generate a DF with col1 and col2, but no col3{code}
{code:java}
pds.dataset(ff, format="parquet").to_table().to_pandas()
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to