pitrou commented on PR #49451:
URL: https://github.com/apache/arrow/pull/49451#issuecomment-3999163862

   > Shouldn't the program already reject the input in this case? What is the 
purpose of the duplicate definition otherwise?
   
   An IPC stream is meant to be read sequentially and therefore has the schema 
appearing at the start of the encoded stream.
   
   An IPC file is basically an IPC stream + a file footer with dedicated 
metadata for random access (a bit like a ZIP file catalog). The IPC file footer 
contains a copy of the schema to reduce the number of required IOs to read into 
the file.
   
   The IPC file reader reads directly from the end of file, ignoring the schema 
that is stored at the start of the encoded IPC stream. Validating that the two 
schemas are identical would do a spurious IO while correct files would have 
identical schemas anyway. 
   
   > Also, would it be possible to "fix" the footer to match the original 
schema?
   
   Ah, you mean use the same schema when comparing the contents? There's no way 
to tell the IPC file reader API to use a different schema for reading, because 
it doesn't make sense with valid IPC files.
   
   Moreover, in some cases the different schema will not matter because only a 
field name changed, but as soon as a more important piece of information has 
changed (for example a field type, or an additional field etc.), then passing 
the wrong schema to the reader will just fail or return gibberish.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to