adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646833158
> when I save a table to csv, it will also save rowid into csv. no system
will do like this.
My problem is with this statement. I don't think there's a universal
definition and use case for "system columns". Spark has one. Postgres has
another. Our system has another.
You use `_rowid` as an example. Is that the `_rowid` within a single file?
Or is that the `_rowid` of the entire table (similar to Postgres' `ctid`)? I
think it's reasonable for both to exist and for both to be considered system
columns. The former does somewhat "loose" it's meaning when copied through a
query from one file to another and it only really makes sense to generate it
dynamically when reading a file. The latter could be copied from one file to
another without issues.
In our case we use system columns to speed up access to JSON: we take a row
with json data such as `json_col: text = [{"a": 1, "b": "lorem"}, {"a": 2}]`
and split it into `_lf__json_col__a: int = [1, 2]` and `__lf__json_col__b:
text = ["lorem", null]`. This is well known technique, it's basically [what
ClickHouse
does](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse).
We write these to files (they are not dynamically generated) and want them to
be treated as normal columns when reading/writing. We just don't want them to
show up when a user does `select *`. Is this not a valid use case for system
columns?
My thought is to establish a piece of metadata marking a column as a system
column with the implementation doing nothing beyond excluding them from `select
*` unless they are explicitly included. That seems to me like a universally
agreed upon thing to do with system columns. Anything else that is not part of
a universal definition of a system column is IMO something that should be
implemented system by system by rewriting logical plans, customizing reading
and writing, etc. Having it as field metadata means this information should be
accessible from most hook points in DataFusion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]