Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2845795660 Thank you @chenkovsky and @adriangb and everyone else who worked on this PR. I think the idea of additional metadata columns for LIstingTable provider is super valuable but it seems like this particular PR adds more complexity than we are willing to handle in the core. I wonder if we could/should create a more full featured equivalent of LIstingTable in another repo / project -- I think there are many interesting ideas / potential things to add -- for example - https://github.com/apache/datafusion/issues/15892 - https://github.com/apache/datafusion/pull/15181 from @phillipleblanc I apologize for letting it sit for so long. This is non ideal to be sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2656813742 yep I agree. TLDR Andrew is: 1. There are two approaches, #14057 (this PR) and #14362. 2. Fundamentally it is not super clear how to define the feature set of a "system column". This PR approaches it from the view point of making it behave as similar as possible to Spark. #14362 specifies that if a column has a certain key set in it's `Field`'s metadata it will not be expanded in `SELECT *` (it currently does some other stuff with that metadata, e.g. disambiguating column name clashes, I'm ambivalent about that part). These approaches end up differing in several places. For example as @chenkovsky pointed out, in #14362 if you add the metadata to a field and then write it to a file that metadata is preserved and that column will act as a system column if you do `select * from 'file.parquet'`. That's not necessarily a bad thing but as @chenkovsky has pointed out that differs from how Spark handles it's system columns. 3. This PR requires somewhat invasive changes (my opinion) into `DFSchema`, including changing how field indexes work, which seemed a bit scary to me hence why I wanted to experiment an alternative approach in #14362. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2656443184 > I am sorry I have not been following along this PR closely -- is it ready for a final review @chenkovsky and @adriangb ? @alamb The discussion was very lively. from my opinion, the pros of this PR are: dataframe api friendly, spark compatible, no need to worry about metadata/system column leakage. for #14362, it makes an assumption on project plan which is not correct for dataframe api, and it should check schema of user input file, otherwise the magic metadata dict will be loaded from file. and it should pay more attention to prevent metadata/system column leakage. in this PR, there are no such problems. In part I agree with @adriangb, If I have another choice, I won't use offset. But I think there is no better option. we also want to hear your opinions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2656327207 I am sorry I have not been following along this PR closely -- is it ready for a final review @chenkovsky and @adriangb ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646922625 > > when I save a table to csv, it will also save rowid into csv. no system will do like this. > > My problem is with this statement. I don't think there's a universal definition and use case for "system columns". Spark has one. Postgres has another. Our system has another. sorry for my arbitrary conclusion, I have to add a qualifier to it, "from my dataframe api experience". I learned a lot of postgres from you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646850170 > > when I save a table to csv, it will also save rowid into csv. no system will do like this. > > My problem is with this statement. I don't think there's a universal definition and use case for "system columns". Spark has one. Postgres has another. Our system has another. first, the assumption of projection is not correct in dataframe, right? I agree with you. there's no universal definition. But If I'm a logical plan implementer. in this PR, if I want to support system column, I can create another logical plan. If I don't want to support system column. I even don't need to take care of what is system column. But in #14362 , If I want to support system column, it's easy, because it's propagated automatically, but If I don't want to support system column, I have to take care to erase it, e.g. CopyTo. at least this is counterintuitive for me. Why do I have to take care of something I don't support? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646833158 > when I save a table to csv, it will also save rowid into csv. no system will do like this. My problem is with this statement. I don't think there's a universal definition and use case for "system columns". Spark has one. Postgres has another. Our system has another. You use `_rowid` as an example. Is that the `_rowid` within a single file? Or is that the `_rowid` of the entire table (similar to Postgres' `ctid`)? I think it's reasonable for both to exist and for both to be considered system columns. The former does somewhat "loose" it's meaning when copied through a query from one file to another and it only really makes sense to generate it dynamically when reading a file. The latter could be copied from one file to another without issues. In our case we use system columns to speed up access to JSON: we take a row with json data such as `json_col: text = [{"a": 1, "b": "lorem"}, {"a": 2}]` and split it into `_lf__json_col__a: int = [1, 2]` and `__lf__json_col__b: text = ["lorem", null]`. This is well known technique, it's basically [what ClickHouse does](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). We write these to files (they are not dynamically generated) and want them to be treated as normal columns when reading/writing. We just don't want them to show up when a user does `select *`. Is this not a valid use case for system columns? My thought is to establish a piece of metadata marking a column as a system column with the implementation doing nothing beyond excluding them from `select *` unless they are explicitly included. That seems to me like a universally agreed upon thing to do with system columns. Anything else that is not part of a universal definition of a system column is IMO something that should be implemented system by system by rewriting logical plans, customizing reading and writing, etc. Having it as field metadata means this information should be accessible from most hook points in DataFusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646781160 > > if a system/metadata column is not selected in project, after project, in spark system/metadata column's still a valid system/metadata column, but it's not that in #14362 > > Is this specific to Spark, or do most systems work this way? We should offer customizable options so users can choose their preference—unless this behavior is standard across most systems. I think both in this PR and #14362 are customizable, but it's harder in #14362, because many behaviors depends on the assumption "there is always a projection in the first pass". @jayzhan211 could you please also see the UT I put in #14362. you can see that this assumption is not correct in dataframe api. and you can also see the rowid save load problem. when I save a table to csv, it will also save rowid into csv. no system will do like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646732851 > if a system/metadata column is not selected in project, after project, in spark system/metadata column's still a valid system/metadata column, but it's not that in https://github.com/apache/datafusion/pull/14362 Is this specific to Spark, or do most systems work this way? We should offer customizable options so users can choose their preference—unless this behavior is standard across most systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644943280 for #14362 ,if we want to make it feasible, I think at least it should check metadata dict loaded from file for every format. (e.g. parquet), otherwise the whole system may be broken if it reads a file that contains magic meta. Is it right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644417122 > > > > for stopping system column propagation, have you tested other logical plans e.g. union intersect? > > > > > > > > > I have not tried constructing logical plans directly. My thought was that even if you do some sort of union you'll still have a projection. For example: > > > ```sql > > > select *, _rowid > > > from t1 > > > union all > > > select *, _rowid > > > from t2 > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In this case `_rowid` would come out as a "regular" column. If you did: > > > ```python > > > select * > > > from t1 > > > union all > > > select * > > > from t2 > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You get no `_rowid` column in the output at all. > > > I'm sure you can create funky stuff by constructing a plan without a projection, but I don't think that happens in the real world even with the optimizer: there is always a projection in the first pass and only after that is evaluated would it get pushed down to a tablescan. > > > ```python > > > > SET datafusion.optimizer.max_passes = 0; > > > 0 row(s) fetched. > > > Elapsed 0.001 seconds. > > > > > > > create table t (x int, y int); > > > 0 row(s) fetched. > > > Elapsed 0.002 seconds. > > > > > > > explain > > > select * > > > from t; > > > +---+---+ > > > | plan_type | plan | > > > +---+---+ > > > | logical_plan | Projection: t.x, t.y | > > > | | TableScan: t| > > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | > > > | | | > > > +---+---+ > > > 2 row(s) fetched. > > > Elapsed 0.006 seconds. > > > > > > > explain > > > select x > > > from t; > > > +---+---+ > > > | plan_type | plan | > > > +---+---+ > > > | logical_plan | Projection: t.x | > > > | | TableScan: t| > > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | > > > | | | > > > +---+---+ > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So I don't think we have to worry about a plan without a projection as long as a plan with a projection correctly evaluates wildcards. > > > > > > as I previously asked, in your implementation "a system column stops being a system column once it's projected" ? > > If this is correct, then as you said there's no need to add more UTs. > > I have to call out it seems that this behavior is incompatible with Spark. I know whether follow Spark's standard is another problem. but community should be aware of this. > > Do you know the rationale for spark that why it doesn't consider projected columns as normal column? I dont know why it doesn't consider metadata/system column as normal column. from my opnion, normal column can be propagated automatically. but metadata/system column should have another propagate strategy, so distinguish them is a correct way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644409227 > Do you know the rationale for spark that why it doesn't consider projected columns as normal column? If my memory is correct, it's also designed for dataframe api user. for `withColumn` clause, it's a project. dataframe users often chain withColumn operators. ```python df.withColumn(...).withColumn(...) ``` If project cannot propagate metadata/system columns, it's very hard to use this chain. I cannot tell are there any other differences between spark standard and #14362 's standard because most sqls will have at least one project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644397522 > > > for stopping system column propagation, have you tested other logical plans e.g. union intersect? > > > > I have not tried constructing logical plans directly. My thought was that even if you do some sort of union you'll still have a projection. For example: > > > > ```sql > > select *, _rowid > > from t1 > > union all > > select *, _rowid > > from t2 > > ``` > > > > In this case `_rowid` would come out as a "regular" column. If you did: > > > > ```python > > select * > > from t1 > > union all > > select * > > from t2 > > ``` > > > > You get no `_rowid` column in the output at all. > > > > I'm sure you can create funky stuff by constructing a plan without a projection, but I don't think that happens in the real world even with the optimizer: there is always a projection in the first pass and only after that is evaluated would it get pushed down to a tablescan. > > > > ```python > > > SET datafusion.optimizer.max_passes = 0; > > 0 row(s) fetched. > > Elapsed 0.001 seconds. > > > > > create table t (x int, y int); > > 0 row(s) fetched. > > Elapsed 0.002 seconds. > > > > > explain > > select * > > from t; > > +---+---+ > > | plan_type | plan | > > +---+---+ > > | logical_plan | Projection: t.x, t.y | > > | | TableScan: t| > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | > > | | | > > +---+---+ > > 2 row(s) fetched. > > Elapsed 0.006 seconds. > > > > > explain > > select x > > from t; > > +---+---+ > > | plan_type | plan | > > +---+---+ > > | logical_plan | Projection: t.x | > > | | TableScan: t| > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | > > | | | > > +---+---+ > > ``` > > > > So I don't think we have to worry about a plan without a projection as long as a plan with a projection correctly evaluates wildcards. > > as I previously asked, in your implementation "a system column stops being a system column once it's projected" ? > If this is correct, then as you said there's no need to add more UTs. > > I have to call out it seems that this behavior is incompatible with Spark. I know whether follow Spark's standard is another problem. but community should be aware of this. > > > Do you know the rationale for spark that why it doesn't consider projected columns as normal column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644389349 > > as I previously asked, in your implementation "a system column stops being a system column once it's projected" ? If this is correct, then as you said there's no need to add more UTs. > > I have to call out it seems that this behavior is incompatible with Spark. I know whether follow Spark's standard is another problem. but community should be aware of this. I have to revoke my judgements for #14362 from metadata/system propagation side, because previously judgements are based on the assumption that difference between two approaches is just how to transmit the information, the goal is same. but it seems that it's not true. #14362 has own propagation rules. It's really hard for me to talk about a totally different thing. let's look pros and cons directly. pros of this approach: 1. dataframe api friendly. There's no chance to hurt themself for dataframe api user. 2. Spark compatible. Spark has already been battle tested in many areas, it's design to be compatible with many different data sources and data sinks. So there's fewer unknown problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643492241 > > for stopping system column propagation, have you tested other logical plans e.g. union intersect? > > I have not tried constructing logical plans directly. My thought was that even if you do some sort of union you'll still have a projection. For example: > > ```sql > select *, _rowid > from t1 > union all > select *, _rowid > from t2 > ``` > > In this case `_rowid` would come out as a "regular" column. If you did: > > ```python > select * > from t1 > union all > select * > from t2 > ``` > > You get no `_rowid` column in the output at all. > > I'm sure you can create funky stuff by constructing a plan without a projection, but I don't think that happens in the real world even with the optimizer: there is always a projection in the first pass and only after that is evaluated would it get pushed down to a tablescan. > > ```python > > SET datafusion.optimizer.max_passes = 0; > 0 row(s) fetched. > Elapsed 0.001 seconds. > > > create table t (x int, y int); > 0 row(s) fetched. > Elapsed 0.002 seconds. > > > explain > select * > from t; > +---+---+ > | plan_type | plan | > +---+---+ > | logical_plan | Projection: t.x, t.y | > | | TableScan: t| > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | > | | | > +---+---+ > 2 row(s) fetched. > Elapsed 0.006 seconds. > > > explain > select x > from t; > +---+---+ > | plan_type | plan | > +---+---+ > | logical_plan | Projection: t.x | > | | TableScan: t| > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | > | | | > +---+---+ > ``` > > So I don't think we have to worry about a plan without a projection as long as a plan with a projection correctly evaluates wildcards. as I previously asked, in your implementation "a system column stops being a system column once it's projected" ? If this is correct, then as you said there's no need to add more UTs. I have to call out it seems that this behavior is incompatible with Spark. I know whether follow Spark's standard is another problem. but community should be aware of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643455108 > > for _rowid save load problem. currently, data engineers have to write a with clause in #14362 > > why do they have to write a `with` clause? It's not clear to me what operations we are talking about but if it's copying data from one table to another: > > ```sql > INSERT INTO t1 > SELECT * FROM t2; > ``` > > Will not include system columns from t2. If they do: > > ```sql > INSERT INTO t1 > SELECT *, _rowid FROM t2; > ``` > > Then yes this will attempt to insert `_rowid` into `t1` which is maybe meaningless but if the user explicitly tried to do that... well we shouldn't stop them. I must say the premise again. just the scenario when data engineers use datafusion as compute engine and want to read and write parquet file. I know there's no problem when we do insert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643295786 > for _rowid save load problem. currently, data engineers have to write a with clause in https://github.com/apache/datafusion/pull/14362 why do they have to write a `with` clause? It's not clear to me what operations we are talking about but if it's copying data from one table to another: ```sql INSERT INTO t1 SELECT * FROM t2; ``` Will not include system columns from t2. If they do: ```sql INSERT INTO t1 SELECT *, _rowid FROM t2; ``` Then yes this will attempt to insert `_rowid` into `t1` which is maybe meaningless but if the user explicitly tried to do that... well we shouldn't stop them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643284573 > for stopping system column propagation, have you tested other logical plans e.g. union intersect? I have not tried constructing logical plans directly. My thought was that even if you do some sort of union you'll still have a projection. For example: ```sql select *, _rowid from t1 union all select *, _rowid from t2 ``` In this case `_rowid` would come out as a "regular" column. If you did: ```python select * from t1 union all select * from t2 ``` You get no `_rowid` column in the output at all. I'm sure you can create funky stuff by constructing a plan without a projection, but I don't think that happens in the real world even with the optimizer: there is always a projection in the first pass and only after that is evaluated would it get pushed down to a tablescan. ```python > SET datafusion.optimizer.max_passes = 0; 0 row(s) fetched. Elapsed 0.001 seconds. > create table t (x int, y int); 0 row(s) fetched. Elapsed 0.002 seconds. > explain select * from t; +---+---+ | plan_type | plan | +---+---+ | logical_plan | Projection: t.x, t.y | | | TableScan: t| | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | | | | +---+---+ 2 row(s) fetched. Elapsed 0.006 seconds. > explain select x from t; +---+---+ | plan_type | plan | +---+---+ | logical_plan | Projection: t.x | | | TableScan: t| | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] | | | | +---+---+ ``` So I don't think we have to worry about a plan without a projection as long as a plan with a projection correctly evaluates wildcards. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2642557135 > @chenkovsky You mentioned that the issues of #14362 are 1) duplicated field issues 2) HashMap > > I think overall datafusion only care about the system columns that are generated by datafusion, other system columns from other engine should be considered normal columns, but since this is just based on my guess not from any practical experience, is there any concern of this assumption? > > For HashMap, I don't think it has performance issue since we only check boolean from it and we don't need to access it frequently given the field should be fixed once created. @jayzhan211 datafusion supports loading files, without such feature, of course, we don't need to take care about this. but with this feature we have to take care about this. and could you please also see the discussion in #14362. we have more different opinions. for example, system/metadata column propagation problem. for _rowid save load problem. currently, data engineer have to write a with clause in #14362 . when using dataframe api, data engineer also have to take more care about metadata dict in #14362. I haven't seen such behavior in other systems. It adds a lot of burden to data engineer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2642508907 @chenkovsky You mentioned that the issues of #14362 are 1) duplicated field issues 2) HashMap I think overall datafusion only care about the system columns that are generated by datafusion, other system columns from other engine should be considered normal columns, but since this is just based on my guess not from any practical experience, is there any concern of this assumption? For HashMap, I don't think it has performance issue since we only check boolean from it and we don't need to access it frequently given the field should be fixed once created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641808013 > > would you mind to add some tests about stopping system column propagation? I haven't seen them on your branch? > > have you seen these? > > https://github.com/apache/datafusion/blob/af6e9725965938e68a6f401dd4ef7fe9eed0c6f1/datafusion/core/tests/sql/system_columns.rs#L318-L376 so for _row_id save load problem. so in your implementation "a system column stops being a system column once it's projected" ? for stopping system column propagation, do you tested other logical plans e.g. union intersect? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641766867 > would you mind to add some tests about stopping system column propagation? I haven't seen them on your branch? have you seen these? https://github.com/apache/datafusion/blob/af6e9725965938e68a6f401dd4ef7fe9eed0c6f1/datafusion/core/tests/sql/system_columns.rs#L318-L376 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641751804 > @chenkovsky the main difference between the two approaches is how to transmit the information on which columns are system columns and which aren't. The approach in this PR does it explicitly by modifying `DFSchema`, `TableProvider` and a couple other spots and also manipulating the meaning of field indexes in `DFSchema`. The approach in #14362 does it by adding metadata to `Field`. They both work but each have pros and cons. > > How about we check with @alamb, @Omega359 and @jayzhan211 what they think sounds best? > > Either way I still think we should name these `system columns` not `metadata columns` to avoid confusion with `DFScheama::metadata_schema` and `DFSchema::metadata` meaning two very different things, etc. @adriangb of course, I want to listen other's opinions. and I also think name is a small thing. changing to system column are also ok. besides _rowid save and load problem. before compare pros and cons, would you mind to add some tests about stopping system column propagation? I haven't seen them on your branch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641172445 @chenkovsky the main difference between the two approaches is how to transmit the information on which columns are system columns and which aren't. The approach in this PR does it explicitly by modifying `DFSchema`, `TableProvider` and a couple other spots and also manipulating the meaning of field indexes in `DFSchema`. The approach in #14362 does it by adding metadata to `Field`. They both work but each have pros and cons. How about we check with @alamb, @Omega359 and @jayzhan211 what they think sounds best? Either way I still think we should name these `system columns` not `metadata columns` to avoid confusion with `DFScheama::metadata_schema` and `DFSchema::metadata` meaning two very different things, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2636206099 > ( I'd like to see some tests against `SchemaAdapter`) @adriangb feel free to correct me, I know maybe I'm wrong. it seems that schema adapater has no relationship with metadata column. metadata column only takes effect on logical plan and physical plan. after selected as recordbatch, no need to distinguish metadata column from normal column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938650952 ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -370,26 +370,6 @@ impl LogicalPlan { } } -/// Gather the schema representating the metadata columns that this plan outputs. -/// This is done by recursively traversing the plan and collecting the metadata columns that are output -/// from inner nodes. -/// See [TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) for more information on metadata columns in general. Review Comment: > I don't seem to be able to find that reference in this PR ... was this added elsewhere in a PR that has yet to be merged? @Omega359 also please review latest codes, it's removed already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938650383 ## datafusion/core/src/execution/session_state.rs: ## @@ -1330,7 +1330,7 @@ impl SessionStateBuilder { /// let url = Url::try_from("file://").unwrap(); /// let object_store = object_store::local::LocalFileSystem::new(); /// let state = SessionStateBuilder::new() -/// .with_config(SessionConfig::new()) Review Comment: Hi, @Omega359 could you please review latested commits. I think these are already reverted before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938514464 ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -370,26 +370,6 @@ impl LogicalPlan { } } -/// Gather the schema representating the metadata columns that this plan outputs. -/// This is done by recursively traversing the plan and collecting the metadata columns that are output -/// from inner nodes. -/// See [TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) for more information on metadata columns in general. Review Comment: sorry my fault . i will delete these lines. please ignore them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
Omega359 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938512422 ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -370,26 +370,6 @@ impl LogicalPlan { } } -/// Gather the schema representating the metadata columns that this plan outputs. -/// This is done by recursively traversing the plan and collecting the metadata columns that are output -/// from inner nodes. -/// See [TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) for more information on metadata columns in general. Review Comment: I don't seem to be able to find that reference in this PR ... was this added elsewhere in a PR that has yet to be merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
Omega359 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938511635 ## datafusion/core/src/execution/session_state.rs: ## @@ -1330,7 +1330,7 @@ impl SessionStateBuilder { /// let url = Url::try_from("file://").unwrap(); /// let object_store = object_store::local::LocalFileSystem::new(); /// let state = SessionStateBuilder::new() -/// .with_config(SessionConfig::new()) Review Comment: The changes in this file add no value and should be reverted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629393602 @adriangb as I have said, it seems that you are thinking about this from database side, I'm talking about compute engine problem. the users i mean is big data engineer. changing metadata dict is very easy through dataframe api. compute engine should not make any assumption on input data. BTW, for _rowid save and load problem, you have a solution now? > field indexes being contiguous If you have read the discussion history of this PR. in my initial implement, field index is contiguous. METADATA_OFFSET is not the key of this design. In my initial design, metadata columns are appended at the end of normal column array virtually. If everyone think METADATA_OFFSET is evil, It's easy to revert it back. that's also why I didn't implement metadata column support for other logical plan. I want to hear more ideas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629230026 > I don't know whether you agree, when make a design, every component should do one thing, and do it well. reuse metadata map violates this, it takes two roles. I have to disagree on that. Field metadata is a hook point to do these sorts of things without having to pipe major code changes throughout the entire codebase. I think this *is* the use case for field metadata. > what makes things worse is that this map is mutable by user. Who do you consider the "user" in this scenario? I am a system implementer and a user of DataFusion. By design and necessity I edit metadata on fields (e.g. to indicate a UTF8 columns is JSON data). The users of the system I implement do not edit field metadata in my system. Maybe you're coming at it from a different perspective of "user" that I'm not understanding? > but for metadata column or system column, we wish it's const for every data source. Maybe but I don't see how it's any different for a `TableProvider` to declare which columns are system columns via a new method on `TableProvider::system_columns_schema` vs adding metadata to fields returned from `TableProvider::schema`. Ultimately I think using field metadata will result in a smaller change in terms of LOC, less new methods and other API changes in DataFusion, will be less likely to break DataFusion implementers code (e.g. because they make assumptions about field indexes being contiguous; I'd like to see some tests against `SchemaAdapter`) and will be easier to retrofit into existing systems with system/metadata columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629206595 > Great let's fix those unit tests on your branch then we can look at the pros/cons of the approaches we've come up with. I don't know whether you agree, when make a design, every component should do one thing, and do it well. reuse metadata map violates this, it takes two roles. what makes things worse is that this map is mutable by user. but for metadata column or system column, we wish it's const for every data source. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629205242 @alamb @adriangb please review it again. I copied some tests from @adriangb and some tests from spark, supported join, project, subqueryalias and dataframe api. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629001039 Great let's fix those unit tests on your branch then we can look at the pros/cons of the approaches we've come up with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2628928295 > Update here is we are very close to cutting the 45 release branch. See > > * More details on [Release DataFusion `45.0.0` #14008 (comment)](https://github.com/apache/datafusion/issues/14008#issuecomment-2628923232) > > Once we do that let's plan to have this PR ready to merge. > > Thanks again @chenkovsky and @adriangb Yes, thanks @adriangb, you also prepared nice UTs for scenario that I have not covered. I'm fixing these UTs now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2628924834 Update here is we are very close to cutting the 45 release branch. See - More details on https://github.com/apache/datafusion/issues/14008#issuecomment-2628923232 Once we do that let's plan to have this PR ready to merge. Thanks again @chenkovsky and @adriangb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1937555727 ## datafusion/core/src/execution/session_state.rs: ## @@ -105,11 +105,11 @@ use uuid::Uuid; /// # #[tokio::main] /// # async fn main() -> Result<()> { /// let state = SessionStateBuilder::new() -/// .with_config(SessionConfig::new()) Review Comment: Revert? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1937553927 ## datafusion-testing: ## Review Comment: This needs to be reverted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2622481316 Here's what I think is a much simpler and more flexible change: https://github.com/apache/datafusion/pull/14362 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2621372943 I also have an another approach. ```rust struct ColumnIndex { pub index: usize, pub is_metadata_column: bool } impl Into for ColumnIndex { } impl From for ColumnIndex { } ``` then we can hide METADATA_OFFSET logic without performance concern. what do you think about this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2621231579 > > Then we cannot make sure metadata columns always have same column index in different tables > > Not quite understand the reason, with modified metadata, meta column could be considered as the same column like the normal one inside `Schema`. **Field index** for accessing field in `Schema` in contiguous like before, indices are changed if the schema is changed. the question whether treat metadata column like normal column, I want to listen more ideas. by the way, at the beginning, I didn't use offset to separate metadata column and normal column. you said that it's error prone. I think when we do schema migration, non-stable column index is really error prone. did I misunderstand something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620931040 Yep from an initial impression I agree with you @jayzhan211 that does seem like it should work. Then you just need to hook into wherever select / projections are evaluated and make sure if a column isn't explicitly mentioned by name and it has that metadata it is not included in the output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620757724 > Then we cannot make sure metadata columns always have same column index in different tables Not quite understand the reason, with modified metadata, meta column could be considered as the same column like the normal one inside `Schema`. **Field index** for accessing field in `Schema` in contiguous like before, indices are changed if the schema is changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620712360 > > sorry @jayzhan211 I edited my comment above after you reacted to it I think, I felt bad about requesting an explainer when there was already a thread with details > > Maybe we can think about whether changing the `metadata` in `Field` makes sense > > ```rust > pub struct Field { > name: String, > data_type: DataType, > nullable: bool, > dict_id: i64, > dict_is_ordered: bool, > /// A map of key-value pairs containing additional custom meta data. > metadata: HashMap, > } > > Field.metadata['is_system_generated'] = 'true' > ``` If we use hashmap to store this flag. Then we cannot make sure metadata columns always have same column index in different tables. I don't know whether it's important. but I think it makes something easy. for example, when we want to use metadata column somewhere other than datafusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620677936 > sorry @jayzhan211 I edited my comment above after you reacted to it I think, I felt bad about requesting an explainer when there was already a thread with details Maybe we can think about whether changing the `metadata` in `Field` makes sense ```rust pub struct Field { name: String, data_type: DataType, nullable: bool, dict_id: i64, dict_is_ordered: bool, /// A map of key-value pairs containing additional custom meta data. metadata: HashMap, } Field.metadata['is_system_generated'] = 'true' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620654418 A high level question: in what ways do metadata columns have to behave differently than other columns? Is it just that they don't show up in `SELECT *` queries? What other sorts of places do they need to behave differently? Is there a way we could do this with metadata e.g. if you set `datafusion.system_column`: `true` on a field then it's considered a system/metadata column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1933220974 ## datafusion-examples/examples/metadata_columns.rs: ## @@ -0,0 +1,330 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::fmt::{self, Debug, Formatter}; +use std::sync::{Arc, Mutex}; +use std::time::Duration; + +use arrow::array::{ArrayRef, StringArray, UInt64Array}; +use arrow_schema::SchemaBuilder; +use async_trait::async_trait; +use datafusion::arrow::array::{UInt64Builder, UInt8Builder}; +use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef}; +use datafusion::arrow::record_batch::RecordBatch; +use datafusion::datasource::{TableProvider, TableType}; +use datafusion::error::Result; +use datafusion::execution::context::TaskContext; +use datafusion::physical_expr::EquivalenceProperties; +use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType}; +use datafusion::physical_plan::memory::MemoryStream; +use datafusion::physical_plan::{ +project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning, +PlanProperties, SendableRecordBatchStream, +}; + +use datafusion::prelude::*; + +use datafusion::catalog::Session; +use datafusion_common::METADATA_OFFSET; +use itertools::Itertools; +use tokio::time::timeout; + +/// This example demonstrates executing a simple query against a custom datasource +#[tokio::main] +async fn main() -> Result<()> { +// create our custom datasource and adding some users +let db = CustomDataSource::default(); +db.populate_users(); + +search_accounts(db.clone(), "select * from accounts", 3).await?; +search_accounts( +db.clone(), +"select _rowid, _file, * from accounts where _rowid > 1", +1, +) +.await?; +search_accounts( +db.clone(), +"select _rowid, _file, * from accounts where _file = 'file-0'", +1, +) +.await?; + +Ok(()) +} + +async fn search_accounts( +db: CustomDataSource, +sql: &str, +expected_result_length: usize, +) -> Result<()> { +// create local execution context +let ctx = SessionContext::new(); +ctx.register_table("accounts", Arc::new(db)).unwrap(); +let options = SQLOptions::new().with_allow_ddl(false); + +timeout(Duration::from_secs(10), async move { +let dataframe = ctx.sql_with_options(sql, options).await.unwrap(); +let result = dataframe.collect().await.unwrap(); +let record_batch = result.first().unwrap(); + +assert_eq!(expected_result_length, record_batch.column(1).len()); +dbg!(record_batch.columns()); +}) +.await +.unwrap(); + +Ok(()) +} + +/// A User, with an id and a bank account +#[derive(Clone, Debug)] +struct User { +id: u8, +bank_account: u64, +} + +/// A custom datasource, used to represent a datastore with a single index +#[derive(Clone)] +pub struct CustomDataSource { +inner: Arc>, +metadata_columns: SchemaRef, +} + +struct CustomDataSourceInner { +data: Vec, +} + +impl Debug for CustomDataSource { +fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { +f.write_str("custom_db") +} +} + +impl CustomDataSource { +pub(crate) async fn create_physical_plan( +&self, +projections: Option<&Vec>, +schema: SchemaRef, +) -> Result> { +Ok(Arc::new(CustomExec::new(projections, schema, self.clone( +} + +pub(crate) fn populate_users(&self) { +self.add_user(User { +id: 1, +bank_account: 9_000, +}); +self.add_user(User { +id: 2, +bank_account: 100, +}); +self.add_user(User { +id: 3, +bank_account: 1_000, +}); +} + +fn add_user(&self, user: User) { +let mut inner = self.inner.lock().unwrap(); +inner.data.push(user); +} +} + +impl Default for CustomDataSource { +fn default() -> Self { +CustomDataSource { +inner: Arc::new(Mutex::new(CustomDataSourceInner { +data: Default::default(), +})), +metadata_columns: Arc::new(Schema::new(vec![
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1933220974 ## datafusion-examples/examples/metadata_columns.rs: ## @@ -0,0 +1,330 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::fmt::{self, Debug, Formatter}; +use std::sync::{Arc, Mutex}; +use std::time::Duration; + +use arrow::array::{ArrayRef, StringArray, UInt64Array}; +use arrow_schema::SchemaBuilder; +use async_trait::async_trait; +use datafusion::arrow::array::{UInt64Builder, UInt8Builder}; +use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef}; +use datafusion::arrow::record_batch::RecordBatch; +use datafusion::datasource::{TableProvider, TableType}; +use datafusion::error::Result; +use datafusion::execution::context::TaskContext; +use datafusion::physical_expr::EquivalenceProperties; +use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType}; +use datafusion::physical_plan::memory::MemoryStream; +use datafusion::physical_plan::{ +project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning, +PlanProperties, SendableRecordBatchStream, +}; + +use datafusion::prelude::*; + +use datafusion::catalog::Session; +use datafusion_common::METADATA_OFFSET; +use itertools::Itertools; +use tokio::time::timeout; + +/// This example demonstrates executing a simple query against a custom datasource +#[tokio::main] +async fn main() -> Result<()> { +// create our custom datasource and adding some users +let db = CustomDataSource::default(); +db.populate_users(); + +search_accounts(db.clone(), "select * from accounts", 3).await?; +search_accounts( +db.clone(), +"select _rowid, _file, * from accounts where _rowid > 1", +1, +) +.await?; +search_accounts( +db.clone(), +"select _rowid, _file, * from accounts where _file = 'file-0'", +1, +) +.await?; + +Ok(()) +} + +async fn search_accounts( +db: CustomDataSource, +sql: &str, +expected_result_length: usize, +) -> Result<()> { +// create local execution context +let ctx = SessionContext::new(); +ctx.register_table("accounts", Arc::new(db)).unwrap(); +let options = SQLOptions::new().with_allow_ddl(false); + +timeout(Duration::from_secs(10), async move { +let dataframe = ctx.sql_with_options(sql, options).await.unwrap(); +let result = dataframe.collect().await.unwrap(); +let record_batch = result.first().unwrap(); + +assert_eq!(expected_result_length, record_batch.column(1).len()); +dbg!(record_batch.columns()); +}) +.await +.unwrap(); + +Ok(()) +} + +/// A User, with an id and a bank account +#[derive(Clone, Debug)] +struct User { +id: u8, +bank_account: u64, +} + +/// A custom datasource, used to represent a datastore with a single index +#[derive(Clone)] +pub struct CustomDataSource { +inner: Arc>, +metadata_columns: SchemaRef, +} + +struct CustomDataSourceInner { +data: Vec, +} + +impl Debug for CustomDataSource { +fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { +f.write_str("custom_db") +} +} + +impl CustomDataSource { +pub(crate) async fn create_physical_plan( +&self, +projections: Option<&Vec>, +schema: SchemaRef, +) -> Result> { +Ok(Arc::new(CustomExec::new(projections, schema, self.clone( +} + +pub(crate) fn populate_users(&self) { +self.add_user(User { +id: 1, +bank_account: 9_000, +}); +self.add_user(User { +id: 2, +bank_account: 100, +}); +self.add_user(User { +id: 3, +bank_account: 1_000, +}); +} + +fn add_user(&self, user: User) { +let mut inner = self.inner.lock().unwrap(); +inner.data.push(user); +} +} + +impl Default for CustomDataSource { +fn default() -> Self { +CustomDataSource { +inner: Arc::new(Mutex::new(CustomDataSourceInner { +data: Default::default(), +})), +metadata_columns: Arc::new(Schema::new(vec![
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1933218367 ## datafusion-examples/examples/metadata_columns.rs: ## @@ -0,0 +1,330 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::fmt::{self, Debug, Formatter}; +use std::sync::{Arc, Mutex}; +use std::time::Duration; + +use arrow::array::{ArrayRef, StringArray, UInt64Array}; +use arrow_schema::SchemaBuilder; +use async_trait::async_trait; +use datafusion::arrow::array::{UInt64Builder, UInt8Builder}; +use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef}; +use datafusion::arrow::record_batch::RecordBatch; +use datafusion::datasource::{TableProvider, TableType}; +use datafusion::error::Result; +use datafusion::execution::context::TaskContext; +use datafusion::physical_expr::EquivalenceProperties; +use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType}; +use datafusion::physical_plan::memory::MemoryStream; +use datafusion::physical_plan::{ +project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning, +PlanProperties, SendableRecordBatchStream, +}; + +use datafusion::prelude::*; + +use datafusion::catalog::Session; +use datafusion_common::METADATA_OFFSET; +use itertools::Itertools; +use tokio::time::timeout; + +/// This example demonstrates executing a simple query against a custom datasource +#[tokio::main] +async fn main() -> Result<()> { +// create our custom datasource and adding some users +let db = CustomDataSource::default(); +db.populate_users(); + +search_accounts(db.clone(), "select * from accounts", 3).await?; +search_accounts( +db.clone(), +"select _rowid, _file, * from accounts where _rowid > 1", +1, +) +.await?; +search_accounts( +db.clone(), +"select _rowid, _file, * from accounts where _file = 'file-0'", +1, +) +.await?; + +Ok(()) +} + +async fn search_accounts( +db: CustomDataSource, +sql: &str, +expected_result_length: usize, +) -> Result<()> { +// create local execution context +let ctx = SessionContext::new(); +ctx.register_table("accounts", Arc::new(db)).unwrap(); +let options = SQLOptions::new().with_allow_ddl(false); + +timeout(Duration::from_secs(10), async move { +let dataframe = ctx.sql_with_options(sql, options).await.unwrap(); +let result = dataframe.collect().await.unwrap(); +let record_batch = result.first().unwrap(); + +assert_eq!(expected_result_length, record_batch.column(1).len()); +dbg!(record_batch.columns()); +}) +.await +.unwrap(); + +Ok(()) +} + +/// A User, with an id and a bank account +#[derive(Clone, Debug)] +struct User { +id: u8, +bank_account: u64, +} + +/// A custom datasource, used to represent a datastore with a single index +#[derive(Clone)] +pub struct CustomDataSource { +inner: Arc>, +metadata_columns: SchemaRef, +} + +struct CustomDataSourceInner { +data: Vec, +} + +impl Debug for CustomDataSource { +fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { +f.write_str("custom_db") +} +} + +impl CustomDataSource { +pub(crate) async fn create_physical_plan( +&self, +projections: Option<&Vec>, +schema: SchemaRef, +) -> Result> { +Ok(Arc::new(CustomExec::new(projections, schema, self.clone( +} + +pub(crate) fn populate_users(&self) { +self.add_user(User { +id: 1, +bank_account: 9_000, +}); +self.add_user(User { +id: 2, +bank_account: 100, +}); +self.add_user(User { +id: 3, +bank_account: 1_000, +}); +} + +fn add_user(&self, user: User) { +let mut inner = self.inner.lock().unwrap(); +inner.data.push(user); +} +} + +impl Default for CustomDataSource { +fn default() -> Self { +CustomDataSource { +inner: Arc::new(Mutex::new(CustomDataSourceInner { +data: Default::default(), +})), +metadata_columns: Arc::new(Schema::new(vec![
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620516737 sorry @jayzhan211 I edited my comment above after you reacted to it I think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620512623 (I found previous discussion on point 2 now, see https://github.com/apache/datafusion/pull/14057#discussion_r1909641034) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620509986 I opened a PR with a lot of docstrings and some minor code edits: https://github.com/chenkovsky/datafusion/pull/1/files @chenkovsky could you take a look and see if it's mergeable? I'm hoping since it's mostly docstrings there shouldn't be much issue. I do have two higher level pieces of feedback on this change: 1. I suggest we rename these to `system columns` as they are called in Postgres, especially because of confusion with metadata of the schema (`DFSchema.inner.schema.metadata`) vs schema of the metadata (`DFSchema.metadata`). 2. Is it really not possible to have a better API than the global `METADATA_OFFSET`? It seems really... terrible honestly. I did not try to implement something else so maybe there's a good reason for it that I am not seeing, but could @chenkovsky or @jayzhan211 give me a quick pitch on why this is necessary before I embarked on a futile path. In my mind we could always append metadata columns to the end right? If you have `a,b` and `meta1,meta2` the indices are `1,2,3,4` respectively. And if you append `c` the metadata columns get bumped. Do we promise that indices are preserved in some way if you modify the schema? Would this break anything? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620005591 I will try to help here Andrew. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2619979342 I am somewhat overwhelmed with - https://github.com/apache/datafusion/issues/14008 (and also - https://github.com/apache/arrow-rs/issues/6929 / https://github.com/apache/arrow-rs/issues/7034 It is on my radar to merge. Maybe someone else can make some of the doc / comment changes I suggested above (either on this PR or as a follow on) which would make it easier for me to merge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2596012697 I plan to merge this PR in tomorrow and file follow on tickets unless anyone would like additional time to review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2594790967 > > > Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too? > > > > > > I with this PR a custom table provider that was ordered by row_id could communicate that information to avoid a SortExec > > From what I can tell, the metadata columns is only a notion in the `LogicalPlan` > > Specifically, the `ExecutionPlan` returned by the provider is no different than any other `ExecutionPlan` so it can communicate sortedness via `ExecutionPlan::properties` as normal > > What I mean is: > > https://github.com/chenkovsky/datafusion/blob/5c4b5c4c7aee47b6287e5fcf32d87485ee1c9e37/datafusion/core/tests/sql/metadata_columns.rs#L389 > > When I print this query, there exists a SortExec for _rowid. But what I understand is _rowid should be a one-by-one increasing column? Maybe not, I use vec to store values in test, but if the inner datastructure is btree, the scan order is not always increasing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
berkaysynnada commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2594770494 > > Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too? > > I with this PR a custom table provider that was ordered by row_id could communicate that information to avoid a SortExec > > From what I can tell, the metadata columns is only a notion in the `LogicalPlan` > > Specifically, the `ExecutionPlan` returned by the provider is no different than any other `ExecutionPlan` so it can communicate sortedness via `ExecutionPlan::properties` as normal What I mean is: https://github.com/chenkovsky/datafusion/blob/5c4b5c4c7aee47b6287e5fcf32d87485ee1c9e37/datafusion/core/tests/sql/metadata_columns.rs#L389 When I print this query, there exists a SortExec for _rowid. But what I understand is _rowid should be a one-by-one increasing column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2593033656 > Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too? I with this PR a custom table provider that was ordered by row_id could communicate that information to avoid a SortExec From what I can tell, the metadata columns is only a notion in the `LogicalPlan` Specifically, the `ExecutionPlan` returned by the provider is no different than any other `ExecutionPlan` so it can communicate sortedness via `ExecutionPlan::properties` as normal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
berkaysynnada commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2587507757 Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2586885945 FWIW I ran the planning benchmarks on this branch and see no measurable difference. ✅ ``` ++ critcmp main feature_metadata_columns groupfeature_metadata_columns main - logical_select_all_from_1000 1.00 5.2±0.03ms? ?/sec1.01 5.3±0.04ms? ?/sec physical_plan_clickbench_all 1.00226.7±1.82ms? ?/sec1.00 227.1±1.40ms? ?/sec physical_plan_tpcds_all 1.00 1380.0±3.29ms? ?/sec1.00 1378.8±5.32ms? ?/sec physical_plan_tpch_all 1.00 90.2±0.70ms? ?/sec1.01 91.1±1.32ms? ?/sec physical_select_all_from_10001.02 42.4±0.30ms? ?/sec1.00 41.5±0.16ms? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585948359 I guess the naming doesn't really hurt our use case so okay let's go with that if it means something in the domain in general 👍🏻 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
Omega359 commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585929480 Metadata column is the name I'm familiar with in other systems. For example, [spark](https://spark.apache.org/docs/3.5.4/api/java/org/apache/spark/sql/connector/catalog/MetadataColumn.html)/[databricks](https://docs.databricks.com/en/ingestion/file-metadata-column.html) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585764005 My only question is if "metadata" is the right name for these columns? Could it be "system" columns or something like that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
adriangb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585758539 We want this as well to hide "special" internal columns we create to speed up JSON columns. +1 for the feature! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585711106 Something other people have asked for in the past (whihc I can't find now) is the ability to know what file a particular row came from in a listing table that combines multiple files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1912435293 ## datafusion/common/src/dfschema.rs: ## @@ -106,37 +106,175 @@ pub type DFSchemaRef = Arc; /// ``` #[derive(Debug, Clone, PartialEq, Eq)] pub struct DFSchema { +inner: QualifiedSchema, +/// Stores functional dependencies in the schema. +functional_dependencies: FunctionalDependencies, +/// metadata columns +metadata: Option, +} + +pub const METADATA_OFFSET: usize = usize::MAX >> 1; Review Comment: Can you please document what this is and how it relates to `DFSchema::inner` ## datafusion/core/tests/sql/metadata_columns.rs: ## @@ -0,0 +1,361 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::fmt::{self, Debug, Formatter}; +use std::sync::{Arc, Mutex}; + +use arrow::compute::concat_batches; +use arrow_array::{ArrayRef, UInt64Array}; +use arrow_schema::SchemaBuilder; +use async_trait::async_trait; +use datafusion::arrow::array::{UInt64Builder, UInt8Builder}; +use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef}; +use datafusion::arrow::record_batch::RecordBatch; +use datafusion::datasource::file_format::csv::CsvSerializer; +use datafusion::datasource::file_format::write::BatchSerializer; +use datafusion::datasource::{TableProvider, TableType}; +use datafusion::error::Result; +use datafusion::execution::context::TaskContext; +use datafusion::physical_expr::EquivalenceProperties; +use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType}; +use datafusion::physical_plan::memory::MemoryStream; +use datafusion::physical_plan::{ +project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning, +PlanProperties, SendableRecordBatchStream, +}; +use datafusion::prelude::*; + +use datafusion::catalog::Session; +use datafusion_common::METADATA_OFFSET; +use itertools::Itertools; + +/// A User, with an id and a bank account +#[derive(Clone, Debug)] +struct User { +id: u8, +bank_account: u64, +} + +/// A custom datasource, used to represent a datastore with a single index +#[derive(Clone)] +pub struct CustomDataSource { +inner: Arc>, +metadata_columns: SchemaRef, +} + +struct CustomDataSourceInner { +data: Vec, +} + +impl Debug for CustomDataSource { +fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { +f.write_str("custom_db") +} +} + +impl CustomDataSource { +pub(crate) async fn create_physical_plan( +&self, +projections: Option<&Vec>, +schema: SchemaRef, +) -> Result> { +Ok(Arc::new(CustomExec::new(projections, schema, self.clone( +} + +pub(crate) fn populate_users(&self) { +self.add_user(User { +id: 1, +bank_account: 9_000, +}); +self.add_user(User { +id: 2, +bank_account: 100, +}); +self.add_user(User { +id: 3, +bank_account: 1_000, +}); +} + +fn add_user(&self, user: User) { +let mut inner = self.inner.lock().unwrap(); +inner.data.push(user); +} +} + +impl Default for CustomDataSource { +fn default() -> Self { +CustomDataSource { +inner: Arc::new(Mutex::new(CustomDataSourceInner { +data: Default::default(), +})), +metadata_columns: Arc::new(Schema::new(vec![Field::new( +"_rowid", +DataType::UInt64, +false, +)])), +} +} +} + +#[async_trait] +impl TableProvider for CustomDataSource { +fn as_any(&self) -> &dyn Any { +self +} + +fn schema(&self) -> SchemaRef { +SchemaRef::new(Schema::new(vec![ +Field::new("id", DataType::UInt8, false), +Field::new("bank_account", DataType::UInt64, true), +])) +} + +fn metadata_columns(&self) -> Option { +Some(self.metadata_columns.clone()) +} + +fn table_type(&self) -> TableType { +TableType::Base +} + +async fn scan( +&self, +_state: &dyn Session, +projection: Option<&Vec>, +//
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911886029 ## datafusion/common/src/dfschema.rs: ## @@ -442,22 +603,24 @@ impl DFSchema { /// Find all fields that match the given name pub fn fields_with_unqualified_name(&self, name: &str) -> Vec<&Field> { -self.fields() -.iter() -.filter(|field| field.name() == name) -.map(|f| f.as_ref()) -.collect() +let mut fields: Vec<&Field> = self.inner.fields_with_unqualified_name(name); +if let Some(schema) = self.metadata_schema() { +fields.append(&mut schema.fields_with_unqualified_name(name)); Review Comment: ```suggestion fields.append(schema.fields_with_unqualified_name(name)); ``` ## datafusion/common/src/dfschema.rs: ## @@ -442,22 +603,24 @@ impl DFSchema { /// Find all fields that match the given name pub fn fields_with_unqualified_name(&self, name: &str) -> Vec<&Field> { -self.fields() -.iter() -.filter(|field| field.name() == name) -.map(|f| f.as_ref()) -.collect() +let mut fields: Vec<&Field> = self.inner.fields_with_unqualified_name(name); +if let Some(schema) = self.metadata_schema() { +fields.append(&mut schema.fields_with_unqualified_name(name)); +} +fields } /// Find all fields that match the given name and return them with their qualifier pub fn qualified_fields_with_unqualified_name( &self, name: &str, ) -> Vec<(Option<&TableReference>, &Field)> { -self.iter() -.filter(|(_, field)| field.name() == name) -.map(|(qualifier, field)| (qualifier, field.as_ref())) -.collect() +let mut fields: Vec<(Option<&TableReference>, &Field)> = +self.inner.qualified_fields_with_unqualified_name(name); +if let Some(schema) = self.metadata_schema() { +fields.append(&mut schema.qualified_fields_with_unqualified_name(name)); Review Comment: ```suggestion fields.append(schema.qualified_fields_with_unqualified_name(name)); ``` ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -2600,6 +2619,16 @@ impl TableScan { let df_schema = DFSchema::new_with_metadata( p.iter() .map(|i| { +if *i >= schema.fields.len() { +if let Some(metadata) = &metadata { +return ( +Some(table_name.clone()), +Arc::new( +metadata.field(*i - METADATA_OFFSET).clone(), Review Comment: handle where i < METADATA_OFFSET -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911887850 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: Okay this approach looks good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911868813 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: Hi @jayzhan211 I pushed a commit, could you please review it again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863259 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: `(-1 as usize)` how does this large offset work? We have vector instead of map -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863137 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: `(-1 as usize)` what is this trick? This is a large offset, we have vector instead of map, how does this work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863137 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: `(-1 as usize)` what is this trick? This is a large offset, we have vector instead of map, how does this work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911814285 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: I see. it's error prone. Can we change the offsets of metadata columns, e.g. (-1 as usize) (-2 as usize) then there's no such problem. I see some databases use this trick. > Isn't only where you need meta columns you need to change the code with meta_field? Others code that call with field remain the same. yes, we can. but many apis use Vec to represent columns. I have to change many structs and method defnitions to pass extra parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911814285 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: I see. it's error prone. Can we change the offsets of metadata columns, e.g. (-1 as usize) (-2 as usize) then there's no such problem. I see some databases use this trick. > Isn't only where you need meta columns you need to change the code with meta_field? Others code that call with field remain the same. yes, we can. but many api use Vec to represent columns. I have to change many structs and method defnitions to pass extra parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911800158 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: Isn't only where you need meta columns you need to change the code with `meta_field`? Others code that call with `field` remain the same. The downside of the current approach is that whenever the schema is changed, the index of meta columns need to adjust too. I think this is error prone. Minimize the dependency of meta schema and schema is better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911800158 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: Isn't only where you need meta columns you need to change the code with `meta_field`? Others code that call with `field` remain the same. The downside of the current approach is that whenever the schema is changed, the index of meta columns need to adjust too. I think this is error prone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
chenkovsky commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1909953573 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: actually implementing another method was my first attempt. but I found that I need to change a lot of code, because column index is used everywhere. that's why in currently implementation metadata column has index + len(fields). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
jayzhan211 commented on code in PR #14057: URL: https://github.com/apache/datafusion/pull/14057#discussion_r1909641034 ## datafusion/common/src/dfschema.rs: ## @@ -319,52 +465,56 @@ impl DFSchema { qualifiers.push(qualifier.cloned()); } } -let mut metadata = self.inner.metadata.clone(); -metadata.extend(other_schema.inner.metadata.clone()); +let mut metadata = self.inner.schema.metadata.clone(); +metadata.extend(other_schema.inner.schema.metadata.clone()); let finished = schema_builder.finish(); let finished_with_metadata = finished.with_metadata(metadata); -self.inner = finished_with_metadata.into(); -self.field_qualifiers.extend(qualifiers); +self.inner.schema = finished_with_metadata.into(); +self.inner.field_qualifiers.extend(qualifiers); } /// Get a list of fields pub fn fields(&self) -> &Fields { -&self.inner.fields +&self.inner.schema.fields } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector pub fn field(&self, i: usize) -> &Field { -&self.inner.fields[i] +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.field(i - self.inner.len()); +} +} +self.inner.field(i) } /// Returns an immutable reference of a specific `Field` instance selected using an /// offset within the internal `fields` vector and its qualifier pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, &Field) { -(self.field_qualifiers[i].as_ref(), self.field(i)) +if i >= self.inner.len() { +if let Some(metadata) = &self.metadata { +return metadata.qualified_field(i - self.inner.len()); +} +} +self.inner.qualified_field(i) Review Comment: Is it better not to mix inner field and meta field? maybe we need another method `meta_field(&self, i: usize)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org