[GitHub] [arrow] jasonkhadka edited a comment on issue #9194: Needs a handling for missing columns in parquet file

GitBox Fri, 15 Jan 2021 03:27:02 -0800


jasonkhadka edited a comment on issue #9194:
URL: https://github.com/apache/arrow/issues/9194#issuecomment-760857808



   > Like a keyword to indicate that missing columns in this list can be 
ignored instead of raising an error?
   
   Yes an `error='ignore'`  keyword would be a perfect solution. 
   
   
   > The name of the missing field is in the error message?
   
   Name of the missing field is there in the error message. But if you want to 
get the field name out of error so that you can use that to drop it from the 
list of columns and try again to read the parquet, it is difficult. 
   The error only contains the message, and it would be great if the error also 
included the field name as property, so error handling could be built. 
   
   
   Example : 
   
   
   ```
   from pyarrow.lib import ArrowInvalid
   
   read_columns = ['a','b','X']
   
   df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'jar']})
   file_name = '/tmp/my_df.pq'
   df.to_parquet(file_name)
   
   
   try:
       df = pd.read_parquet(file_name, columns = read_columns)
   except ArrowInvalid as e:
       inval = e
   ```
   ```
   inval.args
   >("Field named 'X' not found or not unique in the schema.",)
   ```
   
   You could parse the message above to get 'X', but that is a bit of hectic 
solution. Would be great if the error message contained the field name. So, you 
could do for example : 
   
   ```
   inval.field
   > 'X'
   ```
   And with this, one could remove 'X' form the list 'read_columns' and then 
retry reading the parquet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jasonkhadka edited a comment on issue #9194: Needs a handling for missing columns in parquet file

Reply via email to