Re: [PR] [VL] Support mapping columns by index for ORC and Parquet files [incubator-gluten]

via GitHub Sat, 18 Oct 2025 17:10:18 -0700


kevinwilfong commented on PR #10697:
URL: 
https://github.com/apache/incubator-gluten/pull/10697#issuecomment-3325592859


   > Thanks for the PR and detailed PR description. While I am a bit curious 
about the alignment with vanilla Spark - IIRC Spark hasn't yet supported 
index-based schema evolution? Are there any similar enhancements like this for 
your internal JVM Spark?
   
   Thanks for taking a look @zhztheplayer. Admittedly most of my past 
experience outside of Velox comes from Hive (many years ago) and Presto both of 
which supported it (if Spark doesn't support it I'm not sure what the current 
state of it is in Hive). It looks like you're right, Spark only supports a 
flavor of it for ORC https://issues.apache.org/jira/browse/SPARK-32864 
Internally it looks like we're using our own Java readers/writers. Fortunately 
with Gluten/Velox we don't need to do that as long as we can configure it 
correctly.
   
   I was planning to reuse most of the code in this change for supporting Text 
files (Gluten already supports this with ClickHouse), we need to be able to 
propagate the data schema to the readers for that. The configs are really the 
only thing not necessarily aligned with vanilla Spark.
   
   With the changes to propagate the table schema (for Text file), I could see 
extending VeloxIteratorApi internally and overriding  
setFileSchemaForLocalFiles to enable passing the schema for DwrfReadFormat 
(which admittedly is what I'm mostly interested in).
   
   But the connector configs in C++ don't seem very extensible at the moment. I 
think I would have to fork ConfigExtractor, VeloxBackend, or 
WholeStageResultIterator to get the value of the kOrcUseColumnNames config set 
to what I need.
   
   If you don't want those configs explicitly plumbed, a more generic way I can 
think of to handle this might be to add a config to Gluten whose value is a 
comma separated list of configs to extract and add to the connector config.  
That way we could plumb the configs through the C++ code without forking the 
files.
   
   Let me know what you think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [VL] Support mapping columns by index for ORC and Parquet files [incubator-gluten]

Reply via email to