[jira] [Commented] (ARROW-9455) [Python] add option for taking all columns from all files in pa.dataset

Joris Van den Bossche (Jira) Tue, 14 Jul 2020 01:21:11 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157213#comment-17157213
 ]


Joris Van den Bossche commented on ARROW-9455:
----------------------------------------------

[~david-cortes] thanks for the report!

Yes, currently the schema is inferred from the 'first' file, and you can only 
override this by providing the full schema manually. 

There are however options to specify the number of files/fragments to infer 
from, or to infer from all files. This functionality already exists in C++, but 
is not yet exposed in Python. See ARROW-8221 for that. 

Closing as a duplicate of ARROW-8221, but feedback on that issue on a good API 
for this is very welcome!

> [Python] add option for taking all columns from all files in pa.dataset
> -----------------------------------------------------------------------
>
>                 Key: ARROW-9455
>                 URL: https://issues.apache.org/jira/browse/ARROW-9455
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>            Reporter: David Cortes
>            Priority: Minor
>
> In PyArrow's dataset class, if I give it multiple parquet files in a list and 
> these parquet files have potentially different columns, it will always take 
> the schema from the first parquet file in the list, thus ignoring columns 
> that the first file doesn't have. Getting all columns within the files into 
> the same dataset implies passing a manual schema or constructing one by 
> iterating over the files and checking for their columns.
>  
> Would be nicer if PyArrow's dataset class could have an option to 
> automatically take all columns within the files from which it is constructed.
> {code:java}
> import numpy as np, pandas as pd
> df1 = pd.DataFrame({
>     "col1" : np.arange(10),
>     "col2" : np.random.choice(["a", "b"], size=10)
> })
> df2 = pd.DataFrame({
>     "col1" : np.arange(10, 20),
>     "col3" : np.random.random(size=10)
> })
> df1.to_parquet("df1.parquet")
> df2.to_parquet("df2.parquet"){code}
> {code:java}
> import pyarrow.dataset as pds
> ff = ["df1.parquet", "df2.parquet"]{code}
> {code:java}
> ### Code below will generate a DF with col1 and col2, but no col3{code}
> {code:java}
> pds.dataset(ff, format="parquet").to_table().to_pandas()
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9455) [Python] add option for taking all columns from all files in pa.dataset

Reply via email to