jason khadka created ARROW-11473:
------------------------------------

             Summary: Needs a handling for missing columns while reading 
parquet file 
                 Key: ARROW-11473
                 URL: https://issues.apache.org/jira/browse/ARROW-11473
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
            Reporter: jason khadka


Currently there is no way to handle the error raised by missing columns in 
parquet file.

If a column passed is missing, it just raises ArrowInvalid error
{code:java}
columns=[item1, item2, item3] #item3 is not there in parquet file

pd.read_parquet(file_name, columns = columns)

> ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code}
There is no way to handle this. The ArrowInvalid also does not carry any 
information that can give out the field name so that in next try this filed can 
be ignored.

Example :

{{}}
{code:java}

from pyarrow.lib import ArrowInvalid 

read_columns = ['a','b','X'] 
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) 

file_name = '/tmp/my_df.pq' df.to_parquet(file_name) 

try: 
    df = pd.read_parquet(file_name, columns = read_columns) 
except ArrowInvalid as e: 
    inval = e 

print(inval.args)
>("Field named 'X' not found or not unique in the schema.",){code}
 

{{}}

You could parse the message above to get 'X', but that is a bit of hectic 
solution. It would be great if the error message contained the field name. So, 
you could do for example :

 

{{}}
{code:java}
inval.field 
> 'X'{code}
Or a better feature would be to have a error handling in read_table of pyarrow, 
where something like {{error='ignore'}}could be passed. This would then ignore 
the missing column by checking the schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to