jason khadka created ARROW-11473: ------------------------------------ Summary: Needs a handling for missing columns while reading parquet file Key: ARROW-11473 URL: https://issues.apache.org/jira/browse/ARROW-11473 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: jason khadka
Currently there is no way to handle the error raised by missing columns in parquet file. If a column passed is missing, it just raises ArrowInvalid error {code:java} columns=[item1, item2, item3] #item3 is not there in parquet file pd.read_parquet(file_name, columns = columns) > ArrowInvalid: Field named 'item3' not found or not unique in the schema.{code} There is no way to handle this. The ArrowInvalid also does not carry any information that can give out the field name so that in next try this filed can be ignored. Example : {{}} {code:java} from pyarrow.lib import ArrowInvalid read_columns = ['a','b','X'] df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']}) file_name = '/tmp/my_df.pq' df.to_parquet(file_name) try: df = pd.read_parquet(file_name, columns = read_columns) except ArrowInvalid as e: inval = e print(inval.args) >("Field named 'X' not found or not unique in the schema.",){code} {{}} You could parse the message above to get 'X', but that is a bit of hectic solution. It would be great if the error message contained the field name. So, you could do for example : {{}} {code:java} inval.field > 'X'{code} Or a better feature would be to have a error handling in read_table of pyarrow, where something like {{error='ignore'}}could be passed. This would then ignore the missing column by checking the schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)