naveenp2708 commented on PR #54805: URL: https://github.com/apache/spark/pull/54805#issuecomment-4527731907
This PR addresses a specific overflow case in the vectorized reader path. Looking more broadly at `shouldIgnoreCorruptFileException`, the method name suggests corruption-related exceptions, but the current implementation matches any `RuntimeException`, `IOException`, or `InternalError` whenever `ignoreCorruptFiles=true`. Even with that flag enabled, exceptions such as NPEs from reader code paths, schema mismatches during decode, or transient network-related `IOExceptions` like `SocketTimeoutException` do not necessarily indicate file corruption, yet the current logic still skips the file silently. Would it make sense to move toward an allow-list approach that only ignores known corruption-related exceptions/signatures instead of continuing to add narrow exclusions to a very broad match? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
