kevinwilfong commented on PR #10697: URL: https://github.com/apache/incubator-gluten/pull/10697#issuecomment-3325592859
> Thanks for the PR and detailed PR description. While I am a bit curious about the alignment with vanilla Spark - IIRC Spark hasn't yet supported index-based schema evolution? Are there any similar enhancements like this for your internal JVM Spark? Thanks for taking a look @zhztheplayer. Admittedly most of my past experience outside of Velox comes from Hive (many years ago) and Presto both of which supported it (if Spark doesn't support it I'm not sure what the current state of it is in Hive). It looks like you're right, Spark only supports a flavor of it for ORC https://issues.apache.org/jira/browse/SPARK-32864 Internally it looks like we're using our own Java readers/writers. Fortunately with Gluten/Velox we don't need to do that as long as we can configure it correctly. I was planning to reuse most of the code in this change for supporting Text files (Gluten already supports this with ClickHouse), we need to be able to propagate the data schema to the readers for that. The configs are really the only thing not necessarily aligned with vanilla Spark. With the changes to propagate the table schema (for Text file), I could see extending VeloxIteratorApi internally and overriding setFileSchemaForLocalFiles to enable passing the schema for DwrfReadFormat (which admittedly is what I'm mostly interested in). But the connector configs in C++ don't seem very extensible at the moment. I think I would have to fork ConfigExtractor, VeloxBackend, or WholeStageResultIterator to get the value of the kOrcUseColumnNames config set to what I need. If you don't want those configs explicitly plumbed, a more generic way I can think of to handle this might be to add a config to Gluten whose value is a comma separated list of configs to extract and add to the connector config. That way we could plumb the configs through the C++ code without forking the files. Let me know what you think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
