Re: [PR] feat: metadata columns [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2845795660

   Thank you @chenkovsky and @adriangb  and everyone else who worked on this PR.
   
   I think the idea of additional metadata columns for LIstingTable provider is 
super valuable but it seems like this particular PR adds more complexity than 
we are willing to handle in the core. 
   
   I wonder if we could/should create a more full featured equivalent of 
LIstingTable in another repo / project -- I think there are many interesting 
ideas / potential things to add -- for example 
   - https://github.com/apache/datafusion/issues/15892
   - https://github.com/apache/datafusion/pull/15181 from @phillipleblanc 
   
   I apologize for letting it sit for so long. This is non ideal to be sure. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-13 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2656813742

   yep I agree. TLDR Andrew is:
   1. There are two approaches, #14057 (this PR) and  #14362.
   2. Fundamentally it is not super clear how to define the feature set of a 
"system column". This PR approaches it from the view point of making it behave 
as similar as possible to Spark. #14362 specifies that if a column has a 
certain key set in it's `Field`'s metadata it will not be expanded in `SELECT 
*` (it currently does some other stuff with that metadata, e.g. disambiguating 
column name clashes, I'm ambivalent about that part). These approaches end up 
differing in several places. For example as @chenkovsky pointed out, in #14362 
if you add the metadata to a field and then write it to a file that metadata is 
preserved and that column will act as a system column if you do `select * from 
'file.parquet'`. That's not necessarily a bad thing but as @chenkovsky has 
pointed out that differs from how Spark handles it's system columns.
   3. This PR requires somewhat invasive changes (my opinion) into `DFSchema`, 
including changing how field indexes work, which seemed a bit scary to me hence 
why I wanted to experiment an alternative approach in #14362.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-13 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2656443184

   > I am sorry I have not been following along this PR closely -- is it ready 
for a final review @chenkovsky and @adriangb ?
   
   @alamb The discussion was very lively. 
   
   from my opinion, the pros of this PR are: dataframe api friendly, spark 
compatible, no need to worry about metadata/system column leakage. for #14362, 
it makes an assumption on project plan which is not correct for dataframe api, 
and it should check schema of user input file, otherwise the magic metadata 
dict will be loaded from file. and it should pay more attention to prevent 
metadata/system column leakage. in this PR, there are no such problems. In part 
I agree with @adriangb, If I have another choice, I won't use offset. But I 
think there is no better option.
   
   we also want to hear your opinions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-13 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2656327207

   I am sorry I have not been following along this PR closely -- is it ready 
for a final review @chenkovsky and @adriangb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-09 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646922625

   > > when I save a table to csv, it will also save rowid into csv. no system 
will do like this.
   > 
   > My problem is with this statement. I don't think there's a universal 
definition and use case for "system columns". Spark has one. Postgres has 
another. Our system has another.
   
   sorry for my arbitrary conclusion, 
   I have to add a qualifier to it, "from my dataframe api experience". I 
learned a lot of postgres from you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-09 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646850170

   > > when I save a table to csv, it will also save rowid into csv. no system 
will do like this.
   > 
   > My problem is with this statement. I don't think there's a universal 
definition and use case for "system columns". Spark has one. Postgres has 
another. Our system has another.
   
   first, the assumption of projection is not correct in dataframe, right?
   
   I agree with you. there's no universal definition. 
   
   But If I'm a logical plan implementer.
   
   in this PR, if I want to support system column, I can create another logical 
plan. If I don't want to support system column. I even don't need to take care 
of what is system column.
   
   But in #14362 , If I want to support system column, it's easy, because it's 
propagated automatically, but If I don't want to support system column, I have 
to take care to erase it, e.g. CopyTo. at least this is counterintuitive for 
me. Why do I have to take care of something I don't support? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-09 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646833158

   > when I save a table to csv, it will also save rowid into csv. no system 
will do like this.
   
   My problem is with this statement. I don't think there's a universal 
definition and use case for "system columns". Spark has one. Postgres has 
another. Our system has another.
   
   You use `_rowid` as an example. Is that the `_rowid` within a single file? 
Or is that the `_rowid` of the entire table (similar to Postgres' `ctid`)? I 
think it's reasonable for both to exist and for both to be considered system 
columns. The former does somewhat "loose" it's meaning when copied through a 
query from one file to another and it only really makes sense to generate it 
dynamically when reading a file. The latter could be copied from one file to 
another without issues.
   
   In our case we use system columns to speed up access to JSON: we take a row 
with json data such as `json_col: text = [{"a": 1, "b": "lorem"}, {"a": 2}]` 
and split it into `_lf__json_col__a: int = [1, 2]` and `__lf__json_col__b:  
text = ["lorem", null]`.  This is well known technique, it's basically [what 
ClickHouse 
does](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse).
 We write these to files (they are not dynamically generated) and want them to 
be treated as normal columns when reading/writing. We just don't want them to 
show up when a user does `select *`. Is this not a valid use case for system 
columns?
   
   My thought is to establish a piece of metadata marking a column as a system 
column with the implementation doing nothing beyond excluding them from `select 
*` unless they are explicitly included. That seems to me like a universally 
agreed upon thing to do with system columns. Anything else that is not part of 
a universal definition of a system column is IMO something that should be 
implemented system by system by rewriting logical plans, customizing reading 
and writing, etc. Having it as field metadata means this information should be 
accessible from most hook points in DataFusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-09 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646781160

   > > if a system/metadata column is not selected in project, after project, 
in spark system/metadata column's still a valid system/metadata column, but 
it's not that in #14362
   > 
   > Is this specific to Spark, or do most systems work this way? We should 
offer customizable options so users can choose their preference—unless this 
behavior is standard across most systems.
   
   I think both in this PR and #14362 are customizable, but it's harder in 
#14362, because many behaviors depends on the assumption "there is always a 
projection in the first pass". @jayzhan211 could you please also see the UT I 
put in #14362. you can see that this assumption is not correct in dataframe 
api. and you can also see the rowid save load problem. when I save a table to 
csv, it will also save rowid into csv. no system will do like this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-09 Thread via GitHub


jayzhan211 commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2646732851

   > if a system/metadata column is not selected in project, after project, in 
spark system/metadata column's still a valid system/metadata column, but it's 
not that in https://github.com/apache/datafusion/pull/14362
   
   Is this specific to Spark, or do most systems work this way? We should offer 
customizable options so users can choose their preference—unless this behavior 
is standard across most systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-08 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644943280

   for #14362 ,if we want to make it feasible, I think at least it should check 
metadata dict loaded from file for every format. (e.g. parquet), otherwise the 
whole system may be broken if it reads a file that contains magic meta. Is it 
right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644417122

   > > > > for stopping system column propagation, have you tested other 
logical plans e.g. union intersect?
   > > > 
   > > > 
   > > > I have not tried constructing logical plans directly. My thought was 
that even if you do some sort of union you'll still have a projection. For 
example:
   > > > ```sql
   > > > select *, _rowid
   > > > from t1
   > > > union all
   > > > select *, _rowid
   > > > from t2
   > > > ```
   > > > 
   > > > 
   > > > 
   > > >   
   > > > 
   > > > 
   > > >   
   > > > 
   > > > 
   > > > 
   > > >   
   > > > In this case `_rowid` would come out as a "regular" column. If you did:
   > > > ```python
   > > > select *
   > > > from t1
   > > > union all
   > > > select *
   > > > from t2
   > > > ```
   > > > 
   > > > 
   > > > 
   > > >   
   > > > 
   > > > 
   > > >   
   > > > 
   > > > 
   > > > 
   > > >   
   > > > You get no `_rowid` column in the output at all.
   > > > I'm sure you can create funky stuff by constructing a plan without a 
projection, but I don't think that happens in the real world even with the 
optimizer: there is always a projection in the first pass and only after that 
is evaluated would it get pushed down to a tablescan.
   > > > ```python
   > > > > SET datafusion.optimizer.max_passes = 0;
   > > > 0 row(s) fetched.
   > > > Elapsed 0.001 seconds.
   > > > 
   > > > > create table t (x int, y int);
   > > > 0 row(s) fetched.
   > > > Elapsed 0.002 seconds.
   > > > 
   > > > > explain
   > > > select *
   > > > from t;
   > > > +---+---+
   > > > | plan_type | plan  |
   > > > +---+---+
   > > > | logical_plan  | Projection: t.x, t.y  |
   > > > |   |   TableScan: t|
   > > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   > > > |   |   |
   > > > +---+---+
   > > > 2 row(s) fetched.
   > > > Elapsed 0.006 seconds.
   > > > 
   > > > > explain
   > > > select x
   > > > from t;
   > > > +---+---+
   > > > | plan_type | plan  |
   > > > +---+---+
   > > > | logical_plan  | Projection: t.x   |
   > > > |   |   TableScan: t|
   > > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   > > > |   |   |
   > > > +---+---+
   > > > ```
   > > > 
   > > > 
   > > > 
   > > >   
   > > > 
   > > > 
   > > >   
   > > > 
   > > > 
   > > > 
   > > >   
   > > > So I don't think we have to worry about a plan without a projection as 
long as a plan with a projection correctly evaluates wildcards.
   > > 
   > > 
   > > as I previously asked, in your implementation "a system column stops 
being a system column once it's projected" ?
   > > If this is correct, then as you said there's no need to add more UTs.
   > > I have to call out it seems that this behavior is incompatible with 
Spark. I know whether follow Spark's standard is another problem. but community 
should be aware of this.
   > 
   > Do you know the rationale for spark that why it doesn't consider projected 
columns as normal column?
   
   I dont know why  it doesn't consider metadata/system column as normal 
column. 
   
   from my opnion, normal column can be propagated automatically. but 
metadata/system column should have another propagate strategy, so distinguish 
them is a correct way.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644409227

   > Do you know the rationale for spark that why it doesn't consider projected 
columns as normal column?
   
   If my memory is correct, it's also designed for dataframe api user. for 
`withColumn` clause, it's a project.
   dataframe users often chain withColumn operators.
   
   ```python
   df.withColumn(...).withColumn(...)
   ```
   
   If project cannot propagate metadata/system columns, it's very hard to use 
this chain.
   
   I cannot tell are there any other differences between spark standard and 
#14362 's standard because most sqls will have at least one project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


jayzhan211 commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644397522

   > > > for stopping system column propagation, have you tested other logical 
plans e.g. union intersect?
   > > 
   > > I have not tried constructing logical plans directly. My thought was 
that even if you do some sort of union you'll still have a projection. For 
example:
   > > 
   > > ```sql
   > > select *, _rowid
   > > from t1
   > > union all
   > > select *, _rowid
   > > from t2
   > > ```
   > > 
   > > In this case `_rowid` would come out as a "regular" column. If you did:
   > > 
   > > ```python
   > > select *
   > > from t1
   > > union all
   > > select *
   > > from t2
   > > ```
   > > 
   > > You get no `_rowid` column in the output at all.
   > > 
   > > I'm sure you can create funky stuff by constructing a plan without a 
projection, but I don't think that happens in the real world even with the 
optimizer: there is always a projection in the first pass and only after that 
is evaluated would it get pushed down to a tablescan.
   > > 
   > > ```python
   > > > SET datafusion.optimizer.max_passes = 0;
   > > 0 row(s) fetched.
   > > Elapsed 0.001 seconds.
   > > 
   > > > create table t (x int, y int);
   > > 0 row(s) fetched.
   > > Elapsed 0.002 seconds.
   > > 
   > > > explain
   > > select *
   > > from t;
   > > +---+---+
   > > | plan_type | plan  |
   > > +---+---+
   > > | logical_plan  | Projection: t.x, t.y  |
   > > |   |   TableScan: t|
   > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   > > |   |   |
   > > +---+---+
   > > 2 row(s) fetched.
   > > Elapsed 0.006 seconds.
   > > 
   > > > explain
   > > select x
   > > from t;
   > > +---+---+
   > > | plan_type | plan  |
   > > +---+---+
   > > | logical_plan  | Projection: t.x   |
   > > |   |   TableScan: t|
   > > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   > > |   |   |
   > > +---+---+
   > > ```
   > > 
   > > So I don't think we have to worry about a plan without a projection as 
long as a plan with a projection correctly evaluates wildcards.
   > 
   > as I previously asked, in your implementation "a system column stops being 
a system column once it's projected" ? 
   > If this is correct, then as you said there's no need to add more UTs.
   > 
   > I have to call out it seems that this behavior is incompatible with Spark. 
I know whether follow Spark's standard is another problem. but community should 
be aware of this.
   > 
   > 
   > 
   
   Do you know the rationale for spark that why it doesn't consider projected 
columns as normal column?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2644389349

   > 
   > as I previously asked, in your implementation "a system column stops being 
a system column once it's projected" ? If this is correct, then as you said 
there's no need to add more UTs.
   > 
   > I have to call out it seems that this behavior is incompatible with Spark. 
I know whether follow Spark's standard is another problem. but community should 
be aware of this.
   
   I have to revoke my judgements for #14362  from metadata/system propagation 
side, because previously judgements are based on the assumption that difference 
between two approaches is just how to transmit the information, the goal is 
same. but it seems that it's not true. #14362 has own propagation rules. It's 
really hard for me to talk about a totally different thing. let's look pros and 
cons directly.
   
   pros of this approach:
   1. dataframe api friendly. There's no chance to hurt themself for dataframe 
api user.
   2. Spark compatible. Spark has already been battle tested in many areas, 
it's design to be compatible with many different data sources and data sinks. 
So there's fewer unknown problems.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643492241

   > > for stopping system column propagation, have you tested other logical 
plans e.g. union intersect?
   > 
   > I have not tried constructing logical plans directly. My thought was that 
even if you do some sort of union you'll still have a projection. For example:
   > 
   > ```sql
   > select *, _rowid
   > from t1
   > union all
   > select *, _rowid
   > from t2
   > ```
   > 
   > In this case `_rowid` would come out as a "regular" column. If you did:
   > 
   > ```python
   > select *
   > from t1
   > union all
   > select *
   > from t2
   > ```
   > 
   > You get no `_rowid` column in the output at all.
   > 
   > I'm sure you can create funky stuff by constructing a plan without a 
projection, but I don't think that happens in the real world even with the 
optimizer: there is always a projection in the first pass and only after that 
is evaluated would it get pushed down to a tablescan.
   > 
   > ```python
   > > SET datafusion.optimizer.max_passes = 0;
   > 0 row(s) fetched.
   > Elapsed 0.001 seconds.
   > 
   > > create table t (x int, y int);
   > 0 row(s) fetched.
   > Elapsed 0.002 seconds.
   > 
   > > explain
   > select *
   > from t;
   > +---+---+
   > | plan_type | plan  |
   > +---+---+
   > | logical_plan  | Projection: t.x, t.y  |
   > |   |   TableScan: t|
   > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   > |   |   |
   > +---+---+
   > 2 row(s) fetched.
   > Elapsed 0.006 seconds.
   > 
   > > explain
   > select x
   > from t;
   > +---+---+
   > | plan_type | plan  |
   > +---+---+
   > | logical_plan  | Projection: t.x   |
   > |   |   TableScan: t|
   > | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   > |   |   |
   > +---+---+
   > ```
   > 
   > So I don't think we have to worry about a plan without a projection as 
long as a plan with a projection correctly evaluates wildcards.
   
   as I previously asked, in your implementation "a system column stops being a 
system column once it's projected" ? 
   If this is correct, then as you said there's no need to add more UTs.
   
   I have to call out it seems that this behavior is incompatible with Spark. I 
know whether follow Spark's standard is another problem. but community should 
be aware of this.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643455108

   > > for _rowid save load problem. currently, data engineers have to write a 
with clause in #14362
   > 
   > why do they have to write a `with` clause? It's not clear to me what 
operations we are talking about but if it's copying data from one table to 
another:
   > 
   > ```sql
   > INSERT INTO t1
   > SELECT * FROM t2;
   > ```
   > 
   > Will not include system columns from t2. If they do:
   > 
   > ```sql
   > INSERT INTO t1
   > SELECT *, _rowid FROM t2;
   > ```
   > 
   > Then yes this will attempt to insert `_rowid` into `t1` which is maybe 
meaningless but if the user explicitly tried to do that... well we shouldn't 
stop them.
   
   I must say the premise again. just the scenario when data engineers use 
datafusion as compute engine and want to read and write parquet file. I know 
there's no problem when we do insert.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643295786

   > for _rowid save load problem. currently, data engineers have to write a 
with clause in https://github.com/apache/datafusion/pull/14362
   
   why do they have to write a `with` clause? It's not clear to me what 
operations we are talking about but if it's copying data from one table to 
another:
   
   ```sql
   INSERT INTO t1
   SELECT * FROM t2;
   ```
   
   Will not include system columns from t2. If they do:
   
   ```sql
   INSERT INTO t1
   SELECT *, _rowid FROM t2;
   ```
   
   Then yes this will attempt to insert `_rowid` into `t1` which is maybe 
meaningless but if the user explicitly tried to do that... well we shouldn't 
stop them.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2643284573

   > for stopping system column propagation, have you tested other logical 
plans e.g. union intersect?
   
   I have not tried constructing logical plans directly. My thought was that 
even if you do some sort of union you'll still have a projection. For example:
   
   ```sql
   select *, _rowid
   from t1
   union all
   select *, _rowid
   from t2
   ```
   
   In this case `_rowid` would come out as a "regular" column. If you did:
   
   ```python
   select *
   from t1
   union all
   select *
   from t2
   ```
   
   You get no `_rowid` column in the output at all.
   
   I'm sure you can create funky stuff by constructing a plan without a 
projection, but I don't think that happens in the real world even with the 
optimizer: there is always a projection in the first pass and only after that 
is evaluated would it get pushed down to a tablescan.
   
   ```python
   > SET datafusion.optimizer.max_passes = 0;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > create table t (x int, y int);
   0 row(s) fetched.
   Elapsed 0.002 seconds.
   
   > explain
   select *
   from t;
   +---+---+
   | plan_type | plan  |
   +---+---+
   | logical_plan  | Projection: t.x, t.y  |
   |   |   TableScan: t|
   | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   |   |   |
   +---+---+
   2 row(s) fetched.
   Elapsed 0.006 seconds.
   
   > explain
   select x
   from t;
   +---+---+
   | plan_type | plan  |
   +---+---+
   | logical_plan  | Projection: t.x   |
   |   |   TableScan: t|
   | physical_plan | MemoryExec: partitions=1, partition_sizes=[0] |
   |   |   |
   +---+---+
   ```
   
   So I don't think we have to worry about a plan without a projection as long 
as a plan with a projection correctly evaluates wildcards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2642557135

   > @chenkovsky You mentioned that the issues of #14362 are 1) duplicated 
field issues 2) HashMap
   > 
   > I think overall datafusion only care about the system columns that are 
generated by datafusion, other system columns from other engine should be 
considered normal columns, but since this is just based on my guess not from 
any practical experience, is there any concern of this assumption?
   > 
   > For HashMap, I don't think it has performance issue since we only check 
boolean from it and we don't need to access it frequently given the field 
should be fixed once created.
   
   @jayzhan211 
   
   datafusion supports loading files, without such feature, of course, we don't 
need to take care about this. but with this feature we have to take care about 
this. 
   
   and could you please also see the discussion in #14362. we have more 
different opinions.
   
   for example, system/metadata column propagation problem. 
   
   for _rowid save load problem. currently, data engineer have to write a with 
clause in #14362 . when using dataframe api, data engineer also have to take 
more care about metadata dict in #14362. I haven't seen such behavior in other 
systems. It adds a lot of burden to data engineer. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-07 Thread via GitHub


jayzhan211 commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2642508907

   @chenkovsky You mentioned that the issues of #14362 are 1) duplicated field 
issues 2) HashMap
   
   I think overall datafusion only care about the system columns that are 
generated by datafusion, other system columns from other engine should be 
considered normal columns, but since this is just based on my guess not from 
any practical experience, is there any concern of this assumption?
   
   For HashMap, I don't think it has performance issue since we only check 
boolean from it and we don't need to access it frequently given the field 
should be fixed once created.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-06 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641808013

   > > would you mind to add some tests about stopping system column 
propagation? I haven't seen them on your branch?
   > 
   > have you seen these?
   > 
   > 
https://github.com/apache/datafusion/blob/af6e9725965938e68a6f401dd4ef7fe9eed0c6f1/datafusion/core/tests/sql/system_columns.rs#L318-L376
   
   so for _row_id save load problem. so in your implementation "a system column 
stops being a system column once it's projected" ?
   
   for stopping system column propagation, do you tested other logical plans 
e.g. union intersect? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-06 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641766867

   > would you mind to add some tests about stopping system column propagation? 
I haven't seen them on your branch?
   
   have you seen these? 
https://github.com/apache/datafusion/blob/af6e9725965938e68a6f401dd4ef7fe9eed0c6f1/datafusion/core/tests/sql/system_columns.rs#L318-L376


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-06 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641751804

   > @chenkovsky the main difference between the two approaches is how to 
transmit the information on which columns are system columns and which aren't. 
The approach in this PR does it explicitly by modifying `DFSchema`, 
`TableProvider` and a couple other spots and also manipulating the meaning of 
field indexes in `DFSchema`. The approach in #14362 does it by adding metadata 
to `Field`. They both work but each have pros and cons.
   > 
   > How about we check with @alamb, @Omega359 and @jayzhan211 what they think 
sounds best?
   > 
   > Either way I still think we should name these `system columns` not 
`metadata columns` to avoid confusion with `DFScheama::metadata_schema` and 
`DFSchema::metadata` meaning two very different things, etc.
   
   @adriangb 
   
   of course, I want to listen other's opinions. and I also think name is a 
small thing. changing to system column are also ok.
   
   besides _rowid save and load problem. before compare pros and cons, would 
you mind to add some tests about stopping system column propagation? I haven't 
seen them on your branch? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-06 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2641172445

   @chenkovsky the main difference between the two approaches is how to 
transmit the information on which columns are system columns and which aren't. 
The approach in this PR does it explicitly by modifying `DFSchema`, 
`TableProvider` and a couple other spots and also manipulating the meaning of 
field indexes in `DFSchema`. The approach in #14362 does it by adding metadata 
to `Field`. They both work but each have pros and cons.
   
   How about we check with @alamb, @Omega359 and @jayzhan211 what they think 
sounds best?
   
   Either way I still think we should name these `system columns` not `metadata 
columns` to avoid confusion with `DFScheama::metadata_schema` and 
`DFSchema::metadata` meaning two very different things, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-05 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2636206099

   > ( I'd like to see some tests against `SchemaAdapter`)
   
   @adriangb feel free to correct me, I know maybe I'm wrong. it seems that 
schema adapater has no relationship with metadata column. metadata column only 
takes effect on logical plan and physical plan. after selected as recordbatch, 
no need to distinguish metadata column from normal column.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-02 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938650952


##
datafusion/expr/src/logical_plan/plan.rs:
##
@@ -370,26 +370,6 @@ impl LogicalPlan {
 }
 }
 
-/// Gather the schema representating the metadata columns that this plan 
outputs.
-/// This is done by recursively traversing the plan and collecting the 
metadata columns that are output
-/// from inner nodes.
-/// See 
[TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) 
for more information on metadata columns in general.

Review Comment:
   > I don't seem to be able to find that reference in this PR ... was this 
added elsewhere in a PR that has yet to be merged?
   
   @Omega359 also please review latest codes, it's removed already.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-02 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938650383


##
datafusion/core/src/execution/session_state.rs:
##
@@ -1330,7 +1330,7 @@ impl SessionStateBuilder {
 /// let url = Url::try_from("file://").unwrap();
 /// let object_store = object_store::local::LocalFileSystem::new();
 /// let state = SessionStateBuilder::new()
-/// .with_config(SessionConfig::new())

Review Comment:
   Hi, @Omega359  could you please review latested commits. I think these are 
already reverted before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-02 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938514464


##
datafusion/expr/src/logical_plan/plan.rs:
##
@@ -370,26 +370,6 @@ impl LogicalPlan {
 }
 }
 
-/// Gather the schema representating the metadata columns that this plan 
outputs.
-/// This is done by recursively traversing the plan and collecting the 
metadata columns that are output
-/// from inner nodes.
-/// See 
[TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) 
for more information on metadata columns in general.

Review Comment:
   sorry my fault . i will delete these lines. please ignore them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-02 Thread via GitHub


Omega359 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938512422


##
datafusion/expr/src/logical_plan/plan.rs:
##
@@ -370,26 +370,6 @@ impl LogicalPlan {
 }
 }
 
-/// Gather the schema representating the metadata columns that this plan 
outputs.
-/// This is done by recursively traversing the plan and collecting the 
metadata columns that are output
-/// from inner nodes.
-/// See 
[TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) 
for more information on metadata columns in general.

Review Comment:
   I don't seem to be able to find that reference in this PR ... was this added 
elsewhere in a PR that has yet to be merged?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-02 Thread via GitHub


Omega359 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1938511635


##
datafusion/core/src/execution/session_state.rs:
##
@@ -1330,7 +1330,7 @@ impl SessionStateBuilder {
 /// let url = Url::try_from("file://").unwrap();
 /// let object_store = object_store::local::LocalFileSystem::new();
 /// let state = SessionStateBuilder::new()
-/// .with_config(SessionConfig::new())

Review Comment:
   The changes in this file add no value and should be reverted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-02 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629393602

   @adriangb  as I have said, it seems that you are thinking about this from 
database side, I'm talking about compute engine problem. the users i mean is 
big data engineer. changing metadata dict is very easy through dataframe api. 
compute engine should not make any assumption on input data. BTW,  for _rowid 
save and load problem, you have a solution now?
   
   > field indexes being contiguous
   
   If you have read the discussion history of this PR. in my initial implement, 
field index is contiguous. METADATA_OFFSET is not the key of this design. In my 
initial design, metadata columns are appended at the end of normal column array 
virtually.  If everyone think METADATA_OFFSET is evil, It's easy to revert it 
back. that's also why I didn't implement metadata column support for other 
logical plan. I want to hear more ideas.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-01 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629230026

   > I don't know whether you agree, when make a design, every component should 
do one thing, and do it well. reuse metadata map violates this, it takes two 
roles.
   
   I have to disagree on that. Field metadata is a hook point to do these sorts 
of things without having to pipe major code changes throughout the entire 
codebase. I think this *is* the use case for field metadata.
   
   > what makes things worse is that this map is mutable by user.
   
   Who do you consider the "user" in this scenario? I am a system implementer 
and a user of DataFusion. By design and necessity I edit metadata on fields 
(e.g. to indicate a UTF8 columns is JSON data). The users of the system I 
implement do not edit field metadata in my system. Maybe you're coming at it 
from a different perspective of "user" that I'm not understanding?
   
   > but for metadata column or system column, we wish it's const for every 
data source.
   
   Maybe but I don't see how it's any different for a `TableProvider` to 
declare which columns are system columns via a new method on 
`TableProvider::system_columns_schema` vs adding metadata to fields returned 
from `TableProvider::schema`.
   
   Ultimately I think using field metadata will result in a smaller change in 
terms of LOC, less new methods and other API changes in DataFusion, will be 
less likely to break DataFusion implementers code (e.g. because they make 
assumptions about field indexes being contiguous; I'd like to see some tests 
against `SchemaAdapter`) and will be easier to retrofit into existing systems 
with system/metadata columns.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-01 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629206595

   > Great let's fix those unit tests on your branch then we can look at the 
pros/cons of the approaches we've come up with.
   
   I don't know whether you agree, when make a design, every component should 
do one thing, and do it well. reuse metadata map violates this, it takes two 
roles. what makes things worse is that this map is mutable by user. but for 
metadata column or system column, we wish it's const for every data source.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-01 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629205242

   @alamb @adriangb please review it again. I copied some tests from @adriangb 
and some tests from spark,  supported join, project, subqueryalias and 
dataframe api.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-01 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2629001039

   Great let's fix those unit tests on your branch then we can look at the 
pros/cons of the approaches we've come up with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-01 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2628928295

   > Update here is we are very close to cutting the 45 release branch. See
   > 
   > * More details on [Release DataFusion `45.0.0` #14008 
(comment)](https://github.com/apache/datafusion/issues/14008#issuecomment-2628923232)
   > 
   > Once we do that let's plan to have this PR ready to merge.
   > 
   > Thanks again @chenkovsky and @adriangb
   
   Yes, thanks @adriangb, you also prepared nice UTs for scenario that I have 
not covered. I'm fixing these UTs now.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-02-01 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2628924834

   Update here is we are very close to cutting the 45 release branch. See
   - More details on 
https://github.com/apache/datafusion/issues/14008#issuecomment-2628923232
   
   Once we do that let's plan to have this PR ready to merge. 
   
   Thanks again @chenkovsky and @adriangb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-31 Thread via GitHub


adriangb commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1937555727


##
datafusion/core/src/execution/session_state.rs:
##
@@ -105,11 +105,11 @@ use uuid::Uuid;
 /// # #[tokio::main]
 /// # async fn main() -> Result<()> {
 /// let state = SessionStateBuilder::new()
-/// .with_config(SessionConfig::new())  

Review Comment:
   Revert?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-31 Thread via GitHub


adriangb commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1937553927


##
datafusion-testing:
##


Review Comment:
   This needs to be reverted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2622481316

   Here's what I think is a much simpler and more flexible change: 
https://github.com/apache/datafusion/pull/14362


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2621372943

   I also have an another approach. 
   
   ```rust
   struct ColumnIndex {
  pub index: usize,
  pub is_metadata_column: bool
   }
   
   impl Into for ColumnIndex {
   }
   
   impl From for ColumnIndex {
   }
   ```
   
   then we can hide METADATA_OFFSET logic without performance concern. what do 
you think about this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2621231579

   > > Then we cannot make sure metadata columns always have same column index 
in different tables
   > 
   > Not quite understand the reason, with modified metadata, meta column could 
be considered as the same column like the normal one inside `Schema`. **Field 
index** for accessing field in `Schema` in contiguous like before, indices are 
changed if the schema is changed.
   
   the question whether treat metadata column like normal column, I want to 
listen more ideas.
   
   by the way, at the beginning, I didn't use offset to separate metadata 
column and normal column. you said that it's error prone. I think when we do 
schema migration, non-stable column index is really error prone. did I 
misunderstand something?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620931040

   Yep from an initial impression I agree with you @jayzhan211 that does seem 
like it should work. Then you just need to hook into wherever select / 
projections are evaluated and make sure if a column isn't explicitly mentioned 
by name and it has that metadata it is not included in the output 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


jayzhan211 commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620757724

   > Then we cannot make sure metadata columns always have same column index in 
different tables
   
   Not quite understand the reason, with modified metadata, meta column could 
be considered as the same column like the normal one inside `Schema`. **Field 
index** for accessing field in `Schema` in contiguous like before, indices are 
changed if the schema is changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620712360

   > > sorry @jayzhan211 I edited my comment above after you reacted to it I 
think, I felt bad about requesting an explainer when there was already a thread 
with details
   > 
   > Maybe we can think about whether changing the `metadata` in `Field` makes 
sense
   > 
   > ```rust
   > pub struct Field {
   > name: String,
   > data_type: DataType,
   > nullable: bool,
   > dict_id: i64,
   > dict_is_ordered: bool,
   > /// A map of key-value pairs containing additional custom meta data.
   > metadata: HashMap,
   > }
   > 
   > Field.metadata['is_system_generated'] = 'true'
   > ```
   
   If we use hashmap to store this flag. Then we cannot make sure metadata 
columns always have same column index in different tables. I don't know whether 
it's important. but I think it makes something easy.
   for example, when we want to use metadata column somewhere other than 
datafusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


jayzhan211 commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620677936

   > sorry @jayzhan211 I edited my comment above after you reacted to it I 
think, I felt bad about requesting an explainer when there was already a thread 
with details
   
   Maybe we can think about whether changing the `metadata` in `Field` makes 
sense
   
   
   ```rust
   pub struct Field {
   name: String,
   data_type: DataType,
   nullable: bool,
   dict_id: i64,
   dict_is_ordered: bool,
   /// A map of key-value pairs containing additional custom meta data.
   metadata: HashMap,
   }
   
   Field.metadata['is_system_generated'] = 'true'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620654418

   A high level question: in what ways do metadata columns have to behave 
differently than other columns? Is it just that they don't show up in `SELECT 
*` queries? What other sorts of places do they need to behave differently?
   
   Is there a way we could do this with metadata e.g. if you set 
`datafusion.system_column`: `true` on a field then it's considered a 
system/metadata column?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1933220974


##
datafusion-examples/examples/metadata_columns.rs:
##
@@ -0,0 +1,330 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::any::Any;
+use std::fmt::{self, Debug, Formatter};
+use std::sync::{Arc, Mutex};
+use std::time::Duration;
+
+use arrow::array::{ArrayRef, StringArray, UInt64Array};
+use arrow_schema::SchemaBuilder;
+use async_trait::async_trait;
+use datafusion::arrow::array::{UInt64Builder, UInt8Builder};
+use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use datafusion::arrow::record_batch::RecordBatch;
+use datafusion::datasource::{TableProvider, TableType};
+use datafusion::error::Result;
+use datafusion::execution::context::TaskContext;
+use datafusion::physical_expr::EquivalenceProperties;
+use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
+use datafusion::physical_plan::memory::MemoryStream;
+use datafusion::physical_plan::{
+project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning,
+PlanProperties, SendableRecordBatchStream,
+};
+
+use datafusion::prelude::*;
+
+use datafusion::catalog::Session;
+use datafusion_common::METADATA_OFFSET;
+use itertools::Itertools;
+use tokio::time::timeout;
+
+/// This example demonstrates executing a simple query against a custom 
datasource
+#[tokio::main]
+async fn main() -> Result<()> {
+// create our custom datasource and adding some users
+let db = CustomDataSource::default();
+db.populate_users();
+
+search_accounts(db.clone(), "select * from accounts", 3).await?;
+search_accounts(
+db.clone(),
+"select _rowid, _file, * from accounts where _rowid > 1",
+1,
+)
+.await?;
+search_accounts(
+db.clone(),
+"select _rowid, _file, * from accounts where _file = 'file-0'",
+1,
+)
+.await?;
+
+Ok(())
+}
+
+async fn search_accounts(
+db: CustomDataSource,
+sql: &str,
+expected_result_length: usize,
+) -> Result<()> {
+// create local execution context
+let ctx = SessionContext::new();
+ctx.register_table("accounts", Arc::new(db)).unwrap();
+let options = SQLOptions::new().with_allow_ddl(false);
+
+timeout(Duration::from_secs(10), async move {
+let dataframe = ctx.sql_with_options(sql, options).await.unwrap();
+let result = dataframe.collect().await.unwrap();
+let record_batch = result.first().unwrap();
+
+assert_eq!(expected_result_length, record_batch.column(1).len());
+dbg!(record_batch.columns());
+})
+.await
+.unwrap();
+
+Ok(())
+}
+
+/// A User, with an id and a bank account
+#[derive(Clone, Debug)]
+struct User {
+id: u8,
+bank_account: u64,
+}
+
+/// A custom datasource, used to represent a datastore with a single index
+#[derive(Clone)]
+pub struct CustomDataSource {
+inner: Arc>,
+metadata_columns: SchemaRef,
+}
+
+struct CustomDataSourceInner {
+data: Vec,
+}
+
+impl Debug for CustomDataSource {
+fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
+f.write_str("custom_db")
+}
+}
+
+impl CustomDataSource {
+pub(crate) async fn create_physical_plan(
+&self,
+projections: Option<&Vec>,
+schema: SchemaRef,
+) -> Result> {
+Ok(Arc::new(CustomExec::new(projections, schema, self.clone(
+}
+
+pub(crate) fn populate_users(&self) {
+self.add_user(User {
+id: 1,
+bank_account: 9_000,
+});
+self.add_user(User {
+id: 2,
+bank_account: 100,
+});
+self.add_user(User {
+id: 3,
+bank_account: 1_000,
+});
+}
+
+fn add_user(&self, user: User) {
+let mut inner = self.inner.lock().unwrap();
+inner.data.push(user);
+}
+}
+
+impl Default for CustomDataSource {
+fn default() -> Self {
+CustomDataSource {
+inner: Arc::new(Mutex::new(CustomDataSourceInner {
+data: Default::default(),
+})),
+metadata_columns: Arc::new(Schema::new(vec![

Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1933220974


##
datafusion-examples/examples/metadata_columns.rs:
##
@@ -0,0 +1,330 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::any::Any;
+use std::fmt::{self, Debug, Formatter};
+use std::sync::{Arc, Mutex};
+use std::time::Duration;
+
+use arrow::array::{ArrayRef, StringArray, UInt64Array};
+use arrow_schema::SchemaBuilder;
+use async_trait::async_trait;
+use datafusion::arrow::array::{UInt64Builder, UInt8Builder};
+use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use datafusion::arrow::record_batch::RecordBatch;
+use datafusion::datasource::{TableProvider, TableType};
+use datafusion::error::Result;
+use datafusion::execution::context::TaskContext;
+use datafusion::physical_expr::EquivalenceProperties;
+use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
+use datafusion::physical_plan::memory::MemoryStream;
+use datafusion::physical_plan::{
+project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning,
+PlanProperties, SendableRecordBatchStream,
+};
+
+use datafusion::prelude::*;
+
+use datafusion::catalog::Session;
+use datafusion_common::METADATA_OFFSET;
+use itertools::Itertools;
+use tokio::time::timeout;
+
+/// This example demonstrates executing a simple query against a custom 
datasource
+#[tokio::main]
+async fn main() -> Result<()> {
+// create our custom datasource and adding some users
+let db = CustomDataSource::default();
+db.populate_users();
+
+search_accounts(db.clone(), "select * from accounts", 3).await?;
+search_accounts(
+db.clone(),
+"select _rowid, _file, * from accounts where _rowid > 1",
+1,
+)
+.await?;
+search_accounts(
+db.clone(),
+"select _rowid, _file, * from accounts where _file = 'file-0'",
+1,
+)
+.await?;
+
+Ok(())
+}
+
+async fn search_accounts(
+db: CustomDataSource,
+sql: &str,
+expected_result_length: usize,
+) -> Result<()> {
+// create local execution context
+let ctx = SessionContext::new();
+ctx.register_table("accounts", Arc::new(db)).unwrap();
+let options = SQLOptions::new().with_allow_ddl(false);
+
+timeout(Duration::from_secs(10), async move {
+let dataframe = ctx.sql_with_options(sql, options).await.unwrap();
+let result = dataframe.collect().await.unwrap();
+let record_batch = result.first().unwrap();
+
+assert_eq!(expected_result_length, record_batch.column(1).len());
+dbg!(record_batch.columns());
+})
+.await
+.unwrap();
+
+Ok(())
+}
+
+/// A User, with an id and a bank account
+#[derive(Clone, Debug)]
+struct User {
+id: u8,
+bank_account: u64,
+}
+
+/// A custom datasource, used to represent a datastore with a single index
+#[derive(Clone)]
+pub struct CustomDataSource {
+inner: Arc>,
+metadata_columns: SchemaRef,
+}
+
+struct CustomDataSourceInner {
+data: Vec,
+}
+
+impl Debug for CustomDataSource {
+fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
+f.write_str("custom_db")
+}
+}
+
+impl CustomDataSource {
+pub(crate) async fn create_physical_plan(
+&self,
+projections: Option<&Vec>,
+schema: SchemaRef,
+) -> Result> {
+Ok(Arc::new(CustomExec::new(projections, schema, self.clone(
+}
+
+pub(crate) fn populate_users(&self) {
+self.add_user(User {
+id: 1,
+bank_account: 9_000,
+});
+self.add_user(User {
+id: 2,
+bank_account: 100,
+});
+self.add_user(User {
+id: 3,
+bank_account: 1_000,
+});
+}
+
+fn add_user(&self, user: User) {
+let mut inner = self.inner.lock().unwrap();
+inner.data.push(user);
+}
+}
+
+impl Default for CustomDataSource {
+fn default() -> Self {
+CustomDataSource {
+inner: Arc::new(Mutex::new(CustomDataSourceInner {
+data: Default::default(),
+})),
+metadata_columns: Arc::new(Schema::new(vec![

Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1933218367


##
datafusion-examples/examples/metadata_columns.rs:
##
@@ -0,0 +1,330 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::any::Any;
+use std::fmt::{self, Debug, Formatter};
+use std::sync::{Arc, Mutex};
+use std::time::Duration;
+
+use arrow::array::{ArrayRef, StringArray, UInt64Array};
+use arrow_schema::SchemaBuilder;
+use async_trait::async_trait;
+use datafusion::arrow::array::{UInt64Builder, UInt8Builder};
+use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use datafusion::arrow::record_batch::RecordBatch;
+use datafusion::datasource::{TableProvider, TableType};
+use datafusion::error::Result;
+use datafusion::execution::context::TaskContext;
+use datafusion::physical_expr::EquivalenceProperties;
+use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
+use datafusion::physical_plan::memory::MemoryStream;
+use datafusion::physical_plan::{
+project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning,
+PlanProperties, SendableRecordBatchStream,
+};
+
+use datafusion::prelude::*;
+
+use datafusion::catalog::Session;
+use datafusion_common::METADATA_OFFSET;
+use itertools::Itertools;
+use tokio::time::timeout;
+
+/// This example demonstrates executing a simple query against a custom 
datasource
+#[tokio::main]
+async fn main() -> Result<()> {
+// create our custom datasource and adding some users
+let db = CustomDataSource::default();
+db.populate_users();
+
+search_accounts(db.clone(), "select * from accounts", 3).await?;
+search_accounts(
+db.clone(),
+"select _rowid, _file, * from accounts where _rowid > 1",
+1,
+)
+.await?;
+search_accounts(
+db.clone(),
+"select _rowid, _file, * from accounts where _file = 'file-0'",
+1,
+)
+.await?;
+
+Ok(())
+}
+
+async fn search_accounts(
+db: CustomDataSource,
+sql: &str,
+expected_result_length: usize,
+) -> Result<()> {
+// create local execution context
+let ctx = SessionContext::new();
+ctx.register_table("accounts", Arc::new(db)).unwrap();
+let options = SQLOptions::new().with_allow_ddl(false);
+
+timeout(Duration::from_secs(10), async move {
+let dataframe = ctx.sql_with_options(sql, options).await.unwrap();
+let result = dataframe.collect().await.unwrap();
+let record_batch = result.first().unwrap();
+
+assert_eq!(expected_result_length, record_batch.column(1).len());
+dbg!(record_batch.columns());
+})
+.await
+.unwrap();
+
+Ok(())
+}
+
+/// A User, with an id and a bank account
+#[derive(Clone, Debug)]
+struct User {
+id: u8,
+bank_account: u64,
+}
+
+/// A custom datasource, used to represent a datastore with a single index
+#[derive(Clone)]
+pub struct CustomDataSource {
+inner: Arc>,
+metadata_columns: SchemaRef,
+}
+
+struct CustomDataSourceInner {
+data: Vec,
+}
+
+impl Debug for CustomDataSource {
+fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
+f.write_str("custom_db")
+}
+}
+
+impl CustomDataSource {
+pub(crate) async fn create_physical_plan(
+&self,
+projections: Option<&Vec>,
+schema: SchemaRef,
+) -> Result> {
+Ok(Arc::new(CustomExec::new(projections, schema, self.clone(
+}
+
+pub(crate) fn populate_users(&self) {
+self.add_user(User {
+id: 1,
+bank_account: 9_000,
+});
+self.add_user(User {
+id: 2,
+bank_account: 100,
+});
+self.add_user(User {
+id: 3,
+bank_account: 1_000,
+});
+}
+
+fn add_user(&self, user: User) {
+let mut inner = self.inner.lock().unwrap();
+inner.data.push(user);
+}
+}
+
+impl Default for CustomDataSource {
+fn default() -> Self {
+CustomDataSource {
+inner: Arc::new(Mutex::new(CustomDataSourceInner {
+data: Default::default(),
+})),
+metadata_columns: Arc::new(Schema::new(vec![

Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620516737

   sorry @jayzhan211 I edited my comment above after you reacted to it I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620512623

   (I found previous discussion on point 2 now, see 
https://github.com/apache/datafusion/pull/14057#discussion_r1909641034)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620509986

   I opened a PR with a lot of docstrings and some minor code edits: 
https://github.com/chenkovsky/datafusion/pull/1/files
   
   @chenkovsky could you take a look and see if it's mergeable? I'm hoping 
since it's mostly docstrings there shouldn't be much issue.
   
   I do have two higher level pieces of feedback on this change:
   1. I suggest we rename these to `system columns` as they are called in 
Postgres, especially because of confusion with metadata of the schema 
(`DFSchema.inner.schema.metadata`) vs schema of the metadata 
(`DFSchema.metadata`).
   2. Is it really not possible to have a better API than the global 
`METADATA_OFFSET`? It seems really... terrible honestly. I did not try to 
implement something else so maybe there's a good reason for it that I am not 
seeing, but could @chenkovsky or @jayzhan211 give me a quick pitch on why this 
is necessary before I embarked on a futile path. In my mind we could always 
append metadata columns to the end right? If you have `a,b` and `meta1,meta2` 
the indices are `1,2,3,4` respectively. And if you append `c` the metadata 
columns get bumped. Do we promise that indices are preserved in some way if you 
modify the schema? Would this break anything?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620005591

   I will try to help here Andrew.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-28 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2619979342

   I am somewhat overwhelmed with 
   - https://github.com/apache/datafusion/issues/14008
   (and also 
   - https://github.com/apache/arrow-rs/issues/6929 / 
https://github.com/apache/arrow-rs/issues/7034
   
   It is on my radar to merge. Maybe someone else can make some of the doc / 
comment changes I suggested above (either on this PR or as a follow on) which 
would make it easier for me to merge
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-16 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2596012697

   I plan to merge this PR in tomorrow and file follow on tickets unless anyone 
would like additional time to review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-16 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2594790967

   > > > Can these metadata columns utilize normal column properties, like 
ordering equivalences, constantness, distinctness etc.? For example, AFAIU 
rowid is an ordered column, and if I sort the table by rowid, the SortExec 
would be removed? (it seems to me not yet at this point) Can we iterate over 
the design to support those capabilities, too?
   > > 
   > > 
   > > I with this PR a custom table provider that was ordered by row_id could 
communicate that information to avoid a SortExec
   > > From what I can tell, the metadata columns is only a notion in the 
`LogicalPlan`
   > > Specifically, the `ExecutionPlan` returned by the provider is no 
different than any other `ExecutionPlan` so it can communicate sortedness via 
`ExecutionPlan::properties` as normal
   > 
   > What I mean is:
   > 
   > 
https://github.com/chenkovsky/datafusion/blob/5c4b5c4c7aee47b6287e5fcf32d87485ee1c9e37/datafusion/core/tests/sql/metadata_columns.rs#L389
   > 
   > When I print this query, there exists a SortExec for _rowid. But what I 
understand is _rowid should be a one-by-one increasing column?
   
   Maybe not, I use vec to store values in test, but if the inner datastructure 
is btree, the scan order is not always increasing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-16 Thread via GitHub


berkaysynnada commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2594770494

   > > Can these metadata columns utilize normal column properties, like 
ordering equivalences, constantness, distinctness etc.? For example, AFAIU 
rowid is an ordered column, and if I sort the table by rowid, the SortExec 
would be removed? (it seems to me not yet at this point) Can we iterate over 
the design to support those capabilities, too?
   > 
   > I with this PR a custom table provider that was ordered by row_id could 
communicate that information to avoid a SortExec
   > 
   > From what I can tell, the metadata columns is only a notion in the 
`LogicalPlan`
   > 
   > Specifically, the `ExecutionPlan` returned by the provider is no different 
than any other `ExecutionPlan` so it can communicate sortedness via 
`ExecutionPlan::properties` as normal
   
   What I mean is:
   
   
https://github.com/chenkovsky/datafusion/blob/5c4b5c4c7aee47b6287e5fcf32d87485ee1c9e37/datafusion/core/tests/sql/metadata_columns.rs#L389
   
   When I print this query, there exists a SortExec for _rowid. But what I 
understand is _rowid should be a one-by-one increasing column?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-15 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2593033656

   > Can these metadata columns utilize normal column properties, like ordering 
equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an 
ordered column, and if I sort the table by rowid, the SortExec would be 
removed? (it seems to me not yet at this point) Can we iterate over the design 
to support those capabilities, too?
   
   I with this PR a custom table provider that was ordered by row_id could 
communicate that information to avoid a SortExec
   
   From what I can tell, the metadata columns is only a notion in the 
`LogicalPlan`
   
   Specifically, the `ExecutionPlan` returned by the provider is no different 
than any other `ExecutionPlan` so it can communicate sortedness via 
`ExecutionPlan::properties` as normal
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-13 Thread via GitHub


berkaysynnada commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2587507757

   Can these metadata columns utilize normal column properties, like ordering 
equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an 
ordered column, and if I sort the table by rowid, the SortExec would be 
removed? (it seems to me not yet at this point) Can we iterate over the design 
to support those capabilities, too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-13 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2586885945

   FWIW I ran the planning benchmarks on this branch and see no measurable 
difference. ✅ 
   
   ```
   ++ critcmp main feature_metadata_columns
   groupfeature_metadata_columns   main
   -   
   logical_select_all_from_1000 1.00  5.2±0.03ms? ?/sec1.01 
 5.3±0.04ms? ?/sec
   physical_plan_clickbench_all 1.00226.7±1.82ms? ?/sec1.00 
   227.1±1.40ms? ?/sec
   physical_plan_tpcds_all  1.00   1380.0±3.29ms? ?/sec1.00 
  1378.8±5.32ms? ?/sec
   physical_plan_tpch_all   1.00 90.2±0.70ms? ?/sec1.01 
91.1±1.32ms? ?/sec
   physical_select_all_from_10001.02 42.4±0.30ms? ?/sec1.00 
41.5±0.16ms? ?/sec
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-12 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585948359

   I guess the naming doesn't really hurt our use case so okay let's go with 
that if it means something in the domain in general 👍🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-12 Thread via GitHub


Omega359 commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585929480

   Metadata column is the name I'm familiar with in other systems. For example, 
[spark](https://spark.apache.org/docs/3.5.4/api/java/org/apache/spark/sql/connector/catalog/MetadataColumn.html)/[databricks](https://docs.databricks.com/en/ingestion/file-metadata-column.html)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-12 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585764005

   My only question is if "metadata" is the right name for these columns? Could 
it be "system" columns or something like that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-12 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585758539

   We want this as well to hide "special" internal columns we create to speed 
up JSON columns. +1 for the feature!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-12 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2585711106

   Something other people have asked for in the past (whihc I can't find now) 
is the ability to know what file a particular row came from in a listing table 
that combines multiple files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-12 Thread via GitHub


alamb commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1912435293


##
datafusion/common/src/dfschema.rs:
##
@@ -106,37 +106,175 @@ pub type DFSchemaRef = Arc;
 /// ```
 #[derive(Debug, Clone, PartialEq, Eq)]
 pub struct DFSchema {
+inner: QualifiedSchema,
+/// Stores functional dependencies in the schema.
+functional_dependencies: FunctionalDependencies,
+/// metadata columns
+metadata: Option,
+}
+
+pub const METADATA_OFFSET: usize = usize::MAX >> 1;

Review Comment:
   Can you please document what this is and how it relates to `DFSchema::inner`



##
datafusion/core/tests/sql/metadata_columns.rs:
##
@@ -0,0 +1,361 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::any::Any;
+use std::fmt::{self, Debug, Formatter};
+use std::sync::{Arc, Mutex};
+
+use arrow::compute::concat_batches;
+use arrow_array::{ArrayRef, UInt64Array};
+use arrow_schema::SchemaBuilder;
+use async_trait::async_trait;
+use datafusion::arrow::array::{UInt64Builder, UInt8Builder};
+use datafusion::arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use datafusion::arrow::record_batch::RecordBatch;
+use datafusion::datasource::file_format::csv::CsvSerializer;
+use datafusion::datasource::file_format::write::BatchSerializer;
+use datafusion::datasource::{TableProvider, TableType};
+use datafusion::error::Result;
+use datafusion::execution::context::TaskContext;
+use datafusion::physical_expr::EquivalenceProperties;
+use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType};
+use datafusion::physical_plan::memory::MemoryStream;
+use datafusion::physical_plan::{
+project_schema, DisplayAs, DisplayFormatType, ExecutionPlan, Partitioning,
+PlanProperties, SendableRecordBatchStream,
+};
+use datafusion::prelude::*;
+
+use datafusion::catalog::Session;
+use datafusion_common::METADATA_OFFSET;
+use itertools::Itertools;
+
+/// A User, with an id and a bank account
+#[derive(Clone, Debug)]
+struct User {
+id: u8,
+bank_account: u64,
+}
+
+/// A custom datasource, used to represent a datastore with a single index
+#[derive(Clone)]
+pub struct CustomDataSource {
+inner: Arc>,
+metadata_columns: SchemaRef,
+}
+
+struct CustomDataSourceInner {
+data: Vec,
+}
+
+impl Debug for CustomDataSource {
+fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
+f.write_str("custom_db")
+}
+}
+
+impl CustomDataSource {
+pub(crate) async fn create_physical_plan(
+&self,
+projections: Option<&Vec>,
+schema: SchemaRef,
+) -> Result> {
+Ok(Arc::new(CustomExec::new(projections, schema, self.clone(
+}
+
+pub(crate) fn populate_users(&self) {
+self.add_user(User {
+id: 1,
+bank_account: 9_000,
+});
+self.add_user(User {
+id: 2,
+bank_account: 100,
+});
+self.add_user(User {
+id: 3,
+bank_account: 1_000,
+});
+}
+
+fn add_user(&self, user: User) {
+let mut inner = self.inner.lock().unwrap();
+inner.data.push(user);
+}
+}
+
+impl Default for CustomDataSource {
+fn default() -> Self {
+CustomDataSource {
+inner: Arc::new(Mutex::new(CustomDataSourceInner {
+data: Default::default(),
+})),
+metadata_columns: Arc::new(Schema::new(vec![Field::new(
+"_rowid",
+DataType::UInt64,
+false,
+)])),
+}
+}
+}
+
+#[async_trait]
+impl TableProvider for CustomDataSource {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn schema(&self) -> SchemaRef {
+SchemaRef::new(Schema::new(vec![
+Field::new("id", DataType::UInt8, false),
+Field::new("bank_account", DataType::UInt64, true),
+]))
+}
+
+fn metadata_columns(&self) -> Option {
+Some(self.metadata_columns.clone())
+}
+
+fn table_type(&self) -> TableType {
+TableType::Base
+}
+
+async fn scan(
+&self,
+_state: &dyn Session,
+projection: Option<&Vec>,
+//

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911886029


##
datafusion/common/src/dfschema.rs:
##
@@ -442,22 +603,24 @@ impl DFSchema {
 
 /// Find all fields that match the given name
 pub fn fields_with_unqualified_name(&self, name: &str) -> Vec<&Field> {
-self.fields()
-.iter()
-.filter(|field| field.name() == name)
-.map(|f| f.as_ref())
-.collect()
+let mut fields: Vec<&Field> = 
self.inner.fields_with_unqualified_name(name);
+if let Some(schema) = self.metadata_schema() {
+fields.append(&mut schema.fields_with_unqualified_name(name));

Review Comment:
   ```suggestion
   fields.append(schema.fields_with_unqualified_name(name));
   ```



##
datafusion/common/src/dfschema.rs:
##
@@ -442,22 +603,24 @@ impl DFSchema {
 
 /// Find all fields that match the given name
 pub fn fields_with_unqualified_name(&self, name: &str) -> Vec<&Field> {
-self.fields()
-.iter()
-.filter(|field| field.name() == name)
-.map(|f| f.as_ref())
-.collect()
+let mut fields: Vec<&Field> = 
self.inner.fields_with_unqualified_name(name);
+if let Some(schema) = self.metadata_schema() {
+fields.append(&mut schema.fields_with_unqualified_name(name));
+}
+fields
 }
 
 /// Find all fields that match the given name and return them with their 
qualifier
 pub fn qualified_fields_with_unqualified_name(
 &self,
 name: &str,
 ) -> Vec<(Option<&TableReference>, &Field)> {
-self.iter()
-.filter(|(_, field)| field.name() == name)
-.map(|(qualifier, field)| (qualifier, field.as_ref()))
-.collect()
+let mut fields: Vec<(Option<&TableReference>, &Field)> =
+self.inner.qualified_fields_with_unqualified_name(name);
+if let Some(schema) = self.metadata_schema() {
+fields.append(&mut 
schema.qualified_fields_with_unqualified_name(name));

Review Comment:
   ```suggestion
   
fields.append(schema.qualified_fields_with_unqualified_name(name));
   ```



##
datafusion/expr/src/logical_plan/plan.rs:
##
@@ -2600,6 +2619,16 @@ impl TableScan {
 let df_schema = DFSchema::new_with_metadata(
 p.iter()
 .map(|i| {
+if *i >= schema.fields.len() {
+if let Some(metadata) = &metadata {
+return (
+Some(table_name.clone()),
+Arc::new(
+metadata.field(*i - 
METADATA_OFFSET).clone(),

Review Comment:
   handle where i < METADATA_OFFSET



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911887850


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Okay this approach looks good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911868813


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Hi @jayzhan211 I pushed a commit, could you please review it again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863259


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   `(-1 as usize)` how does this large offset work? We have vector instead of 
map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863137


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   `(-1 as usize)` what is this trick? This is a large offset, we have vector 
instead of map, how does this work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863137


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   `(-1 as usize)` what is this trick? This is a large offset, we have vector 
instead of map, how does this work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911814285


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   I see. it's error prone. Can we change the offsets of metadata columns, e.g. 
(-1 as usize) (-2 as usize) then there's no such problem. I see some databases 
use this trick.
   
   > Isn't only where you need meta columns you need to change the code with 
meta_field? Others code that call with field remain the same.
   
   yes, we can. but many apis use Vec to represent columns. I have to 
change many structs and method defnitions to pass extra parameters.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911814285


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   I see. it's error prone. Can we change the offsets of metadata columns, e.g. 
(-1 as usize) (-2 as usize) then there's no such problem. I see some databases 
use this trick.
   
   > Isn't only where you need meta columns you need to change the code with 
meta_field? Others code that call with field remain the same.
   
   yes, we can. but many api use Vec to represent columns. I have to 
change many structs and method defnitions to pass extra parameters.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911800158


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Isn't only where you need meta columns you need to change the code with 
`meta_field`? Others code that call with `field` remain the same.
   
   The downside of the current approach is that whenever the schema is changed, 
the index of meta columns need to adjust too. I think this is error prone. 
Minimize the dependency of meta schema and schema is better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911800158


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Isn't only where you need meta columns you need to change the code with 
`meta_field`? Others code that call with `field` remain the same.
   
   The downside of the current approach is that whenever the schema is changed, 
the index of meta columns need to adjust too. I think this is error prone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-09 Thread via GitHub


chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1909953573


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   actually implementing another method was my first attempt. but I found that 
I need to change a lot of code, because column index is used everywhere. that's 
why in currently implementation metadata column has index + len(fields).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-09 Thread via GitHub


jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1909641034


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Is it better not to mix inner field and meta field?
   
   maybe we need another method `meta_field(&self, i: usize)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org