Re: [PR] feat: metadata columns [datafusion]

via GitHub Sun, 09 Feb 2025 19:22:04 -0800


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646833158


   > when I save a table to csv, it will also save rowid into csv. no system 
will do like this.
   
   My problem is with this statement. I don't think there's a universal 
definition and use case for "system columns". Spark has one. Postgres has 
another. Our system has another.
   
   You use `_rowid` as an example. Is that the `_rowid` within a single file? 
Or is that the `_rowid` of the entire table (similar to Postgres' `ctid`)? I 
think it's reasonable for both to exist and for both to be considered system 
columns. The former does somewhat "loose" it's meaning when copied through a 
query from one file to another and it only really makes sense to generate it 
dynamically when reading a file. The latter could be copied from one file to 
another without issues.
   
   In our case we use system columns to speed up access to JSON: we take a row 
with json data such as `json_col: text = [{"a": 1, "b": "lorem"}, {"a": 2}]` 
and split it into `_lf__json_col__a: int = [1, 2]` and `__lf__json_col__b:  
text = ["lorem", null]`.  This is well known technique, it's basically [what 
ClickHouse 
does](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse).
 We write these to files (they are not dynamically generated) and want them to 
be treated as normal columns when reading/writing. We just don't want them to 
show up when a user does `select *`. Is this not a valid use case for system 
columns?
   
   My thought is to establish a piece of metadata marking a column as a system 
column with the implementation doing nothing beyond excluding them from `select 
*` unless they are explicitly included. That seems to me like a universally 
agreed upon thing to do with system columns. Anything else that is not part of 
a universal definition of a system column is IMO something that should be 
implemented system by system by rewriting logical plans, customizing reading 
and writing, etc. Having it as field metadata means this information should be 
accessible from most hook points in DataFusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: metadata columns [datafusion]

Reply via email to