Hi, Venkat, Thanks for the FLIP, it sounds good to support nested fields filter pushdown. Based on the design of flip and the above options, I would like to make a few suggestions:
1. At present, introducing NestedFieldReferenceExpression looks like a better solution, which can fully meet our requirements while reducing modifications to base class FieldReferenceExpression. In the long run, I tend to abstract a basic class for NestedFieldReferenceExpression and FieldReferenceExpression as u suggested. 2. Personally, I don't recommend introducing *supportsNestedFilters() in supportsFilterPushdown. We just need to better declare the return value of the method *applyFilters. 3. Finally, I think we need to look at the costs and benefits of unifying the SupportsFilterPushDown and SupportsProjectionPushDown (or others) from the perspective of interface implementers. A stable API can reduce user development and change costs, if the current API can fully meet the functional requirements at the framework level, I personal suggest reducing the impact on connector developers. Regards, Yunhong Zheng (Swuferhong) Venkatakrishnan Sowrirajan <[email protected]> 于2023年8月25日周五 01:25写道: > To keep it backwards compatible, introduce another API *applyAggregates > *with > *List<ReferenceExpression> *when nested field support is added and > deprecate the current API. This will by default throw an exception. In > flink planner, *applyAggregates *with nested fields and if it throws > exception then *applyAggregates* without nested fields. > > Regards > Venkata krishnan > > > On Thu, Aug 24, 2023 at 10:13 AM Venkatakrishnan Sowrirajan < > [email protected]> wrote: > > > Jark, > > > > How about having a separate NestedFieldReferenceExpression, and > >> abstracting a common base class "ReferenceExpression" for > >> NestedFieldReferenceExpression and FieldReferenceExpression? This makes > >> unifying expressions in > >> "SupportsProjectionPushdown#applyProjections(List<ReferenceExpression> > >> ...)" > >> possible. > > > > This should be fine for *SupportsProjectionPushDown* and > > *SupportsFilterPushDown*. One concern in the case of > > *SupportsAggregatePushDown* with nested fields support (to be added in > > the future), with this proposal, the API will become backwards > incompatible > > as the *args *for the aggregate function is > *List<FieldReferenceExpression> > > *that needs to change to *List<ReferenceExpression>*. > > > > Regards > > Venkata krishnan > > > > > > On Thu, Aug 24, 2023 at 1:18 AM Jark Wu <[email protected]> wrote: > > > >> Hi Becket, > >> > >> I think it is the second case, that a FieldReferenceExpression is > >> constructed > >> by the framework and passed to the connector (interfaces listed by > >> Venkata[1] > >> and Catalog#listPartitionsByFilter). Besides, understanding the nested > >> field > >> is optional for users/connectors (just treat it as an unknown expression > >> if > >> the > >> connector doesn't want to support it). > >> > >> If we extend FieldReferenceExpression, in the case of "where col.nested > > > >> 10", > >> for the connectors already supported filter/delete pushdown, they may > >> wrongly > >> pushdown "col > 10" instead of "nested > 10" because they still treat > >> FieldReferenceExpression as a top-level column. This problem can be > >> resolved > >> by introducing an additional "supportedNestedPushdown" for each > interface, > >> but that method is not elegant and is hard to remove in the future, and > >> this could > >> be avoided if we have a separate NestedFieldReferenceExpression. > >> > >> If we want to extend FieldReferenceExpression, we have to add > protections > >> for every related API in one shot. Besides, FieldReferenceExpression is > a > >> fundamental class in the planner, we have to go through all the code > that > >> is using it to make sure it properly handling it if it is a nested field > >> which > >> is a big effort for the community. > >> > >> If we were designing this API on day 1, I fully support merging them in > a > >> FieldReferenceExpression. But in this case, I'm thinking about how to > >> provide > >> users with a smooth migration path, and allow the community to gradually > >> put efforts into evolving the API, and not block the "Nested Fields > Filter > >> Pushdown" > >> requirement. > >> > >> How about having a separate NestedFieldReferenceExpression, and > >> abstracting a common base class "ReferenceExpression" for > >> NestedFieldReferenceExpression and FieldReferenceExpression? This makes > >> unifying expressions in > >> "SupportsProjectionPushdown#applyProjections(List<ReferenceExpression> > >> ...)" > >> possible. > >> > >> Best, > >> Jark > >> > >> On Thu, 24 Aug 2023 at 07:00, Venkatakrishnan Sowrirajan < > >> [email protected]> > >> wrote: > >> > >> > Becket and Jark, > >> > > >> > Deprecate all the other > >> > > methods except tryApplyFilters() and tryApplyProjections(). > >> > > >> > For *SupportsProjectionPushDown*, we still need a > >> > *supportsNestedProjections* API on the table source as some of the > table > >> > sources might not be able to handle nested fields and therefore the > >> Flink > >> > planner should not push down the nested projections or else the > >> > *applyProjection > >> > *API has to be appropriately changed to return > >> > *unconvertibleProjections *similar > >> > to *SupportsFilterPushDown*. > >> > > >> > Or we have to introduce two different applyProjections() > >> > > methods for FieldReferenceExpression / > NestedFieldReferenceExpression > >> > > respectively. > >> > > >> > Agree this is not preferred. Given that *supportNestedProjections > >> *cannot > >> > be deprecated/removed based on the current API form, extending > >> > *FieldReferenceExpression* to support nested fields should be okay. > >> > > >> > Another alternative could be to change *applyProjections *to take > >> > List<ResolvedExpression> and on the connector side they choose to > handle > >> > *FieldReferenceExpression* and *NestedFieldReferenceExpression *as > >> > applicable and return the remainingProjections. In the case of nested > >> field > >> > projections not supported, it should return them back but only > >> projecting > >> > the top level fields. IMO, this is also *not preferred*. > >> > > >> > *SupportsAggregatePushDown* > >> > > >> > *AggregateExpression *currently takes in a list of > >> > *FieldReferenceExpression* as args for the aggregate function, if in > >> future > >> > *SupportsAggregatePushDown* adds support for aggregate pushdown on > >> nested > >> > fields then the AggregateExpression API also has to change if a new > >> > NestedFieldReferenceExpression is introduced for nested fields. > >> > > >> > If we add a > >> > > flag for each new filter, > >> > > the interface will be filled with lots of flags (e.g., > >> supportsBetween, > >> > > supportsIN) > >> > > >> > In an ideal situation, I completely agree with you. But in the current > >> > state, *supportsNestedFilters* can act as a bridge to reach the > eventual > >> > desired state which is to have a clean and consistent set of APIs > >> > throughout all Supports*PushDown. > >> > > >> > Also shared some thoughts on the end state API > >> > < > >> > > >> > https://urldefense.com/v3/__https://docs.google.com/document/d/1stLRPKOcxlEv8eHblkrOh0Zf5PLM-h76WMhEINHOyPY/edit?usp=sharing__;!!IKRxdwAv5BmarQ!ZZ2nS1PYlXLnEGFcikS3NsYG7tMaV3wU_z7FmvihNwQBmoLZk2WmcpuRWszK0FFmsInh9A6cndkJrQ$ > >> > > > >> > with extension to the *FieldReferenceExpression* to support nested > >> fields. > >> > Please take a look. > >> > > >> > Regards > >> > Venkata krishnan > >> > > >> > On Tue, Aug 22, 2023 at 5:02 PM Becket Qin <[email protected]> > >> wrote: > >> > > >> > > Hi Jark, > >> > > > >> > > Regarding the migration path, it would be useful to scrutinize the > use > >> > case > >> > > of FiledReferenceExpression and ResolvedExpressions. There are two > >> kinds > >> > of > >> > > use cases: > >> > > > >> > > 1. A ResolvedExpression is constructed by the user or connector / > >> plugin > >> > > developers. > >> > > 2. A ResolvedExpression is constructed by the framework and passed > to > >> > user > >> > > or connector / plugin developers. > >> > > > >> > > For the first case, both of the approaches provide the same > migration > >> > > experience. > >> > > > >> > > For the second case, generally speaking, introducing > >> > > NestedFieldReferenceExpression and extending > FieldReferenceExpression > >> > would > >> > > have the same impact for backwards compatibility. > >> SupportsFilterPushDown > >> > is > >> > > a special case here because understanding the filter expressions is > >> > > optional for the source implementation. In other use cases, if > >> > > understanding the reference to a nested field is a must have, the > user > >> > code > >> > > has to be changed, regardless of which approach we take to support > >> nested > >> > > fields. > >> > > > >> > > Therefore, I think we have to check each public API where the nested > >> > field > >> > > reference is exposed. If we have many public APIs where > understanding > >> > > nested fields is optional for the user / plugin / connector > >> developers, > >> > > having a separate NestedFieldReferenceExpression would have a more > >> smooth > >> > > migration. Otherwise, there seems to be no difference between the > two > >> > > approaches. > >> > > > >> > > Migration path aside, the main reason I prefer extending > >> > > FieldReferenceExpression over a new NestedFieldReferenceExpression > is > >> > > because this makes the SupportsProjectionPushDown interface simpler. > >> > > Otherwise, we have to treat it as a special case that does not match > >> the > >> > > overall API style. Or we have to introduce two different > >> > applyProjections() > >> > > methods for FieldReferenceExpression / > NestedFieldReferenceExpression > >> > > respectively. This issue further extends to implementation in > >> addition to > >> > > public API. A single FieldReferenceExpression might help simplify > the > >> > > implementation code a little bit. For example, in a recursive > >> processing > >> > of > >> > > a row with nested rows, we may not need to switch between > >> > > FieldReferenceExpression and NestedFieldReferenceExpression > depending > >> on > >> > > whether the record being processed is a top level record or nested > >> > record. > >> > > > >> > > Thanks, > >> > > > >> > > Jiangjie (Becket) Qin > >> > > > >> > > > >> > > On Tue, Aug 22, 2023 at 11:43 PM Jark Wu <[email protected]> wrote: > >> > > > >> > > > Hi Becket, > >> > > > > >> > > > I totally agree we should try to have a consistent API for a final > >> > state. > >> > > > The only concern I have mentioned is the "smooth" migration path. > >> > > > The FiledReferenceExpression is widely used in many public APIs, > >> > > > not only in the SupportsFilterPushDown. Yes, we can change every > >> > > > methods in 2-steps, but is it good to change API back and forth > for > >> > this? > >> > > > Personally, I'm fine with a separate > NestedFieldReferenceExpression > >> > > class. > >> > > > TBH, I prefer the separated way because it makes the reference > >> > expression > >> > > > more clear and concise. > >> > > > > >> > > > Best, > >> > > > Jark > >> > > > > >> > > > > >> > > > On Tue, 22 Aug 2023 at 16:53, Becket Qin <[email protected]> > >> wrote: > >> > > > > >> > > > > Thanks for the reply, Jark. > >> > > > > > >> > > > > I think it will be helpful to understand the final state we want > >> to > >> > > > > eventually achieve first, then we can discuss the steps towards > >> that > >> > > > final > >> > > > > state. > >> > > > > > >> > > > > It looks like there are two proposed end states now: > >> > > > > > >> > > > > 1. Have a separate NestedFieldReferenceExpression class; keep > >> > > > > SupportsFilterPushDown and SupportsProjectionPushDown the same. > >> It is > >> > > > just > >> > > > > a one step change. > >> > > > > - Regarding the supportsNestedFilterPushDown() method, if our > >> > > contract > >> > > > > with the connector developer today is "The implementation should > >> > ignore > >> > > > > unrecognized expressions by putting them into the remaining > >> filters, > >> > > > > instead of throwing exceptions". Then there is no need for this > >> > > method. I > >> > > > > am not sure about the current contract. We should probably make > it > >> > > clear > >> > > > in > >> > > > > the interface Java doc. > >> > > > > > >> > > > > 2. Extend the existing FiledReferenceExpression class to support > >> > nested > >> > > > > fields; SupportsFilterPushDown only has one method of > >> > > > > applyFilters(List<ResolvedExpression>); > SupportsProjectionPushDown > >> > only > >> > > > has > >> > > > > one method of applyProjections(List<FieldReferenceExpression>, > >> > > DataType). > >> > > > > It could just be two steps if we are not too obsessed with the > >> exact > >> > > > names > >> > > > > of "applyFilters" and "applyProjections". More specifically, it > >> takes > >> > > two > >> > > > > steps to achieve this final state: > >> > > > > a. introduce a new method > >> > tryApplyFilters(List<ResolvedExpression>) > >> > > > to > >> > > > > SupportsFilterPushDown, which may have FiledReferenceExpression > >> with > >> > > > nested > >> > > > > fields. The default implementation throws an exception. The > >> runtime > >> > > will > >> > > > > first call tryApplyFilters() with nested fields. In case of > >> > exception, > >> > > it > >> > > > > calls the existing applyFilters() without including the nested > >> > filters. > >> > > > > Similarly, in SupportsProjectionPushDown, introduce a > >> > > > > tryApplyProjections<List<NestedFieldReference> method returning > a > >> > > Result. > >> > > > > The Result also contains the accepted and unapplicable > >> projections. > >> > The > >> > > > > default implementation also throws an exception. Deprecate all > the > >> > > other > >> > > > > methods except tryApplyFilters() and tryApplyProjections(). > >> > > > > b. remove the deprecated methods in the next major version > >> bump. > >> > > > > > >> > > > > Now the question is putting the migration steps aside, which end > >> > state > >> > > do > >> > > > > we prefer? While the first end state is acceptable for me, > >> > personally, > >> > > I > >> > > > > prefer the latter if we are designing from scratch. It is clean, > >> > > > consistent > >> > > > > and intuitive. Given the size of Flink, keeping APIs in the same > >> > style > >> > > > over > >> > > > > time is important. The migration is also not that complicated. > >> > > > > > >> > > > > Thanks, > >> > > > > > >> > > > > Jiangjie (Becket) Qin > >> > > > > > >> > > > > > >> > > > > On Tue, Aug 22, 2023 at 2:23 PM Jark Wu <[email protected]> > wrote: > >> > > > > > >> > > > > > Hi Venkat, > >> > > > > > > >> > > > > > Thanks for the proposal. > >> > > > > > > >> > > > > > I have some minor comments about the FLIP. > >> > > > > > > >> > > > > > 1. I think we don't need to > >> > > > > > add SupportsFilterPushDown#supportsNestedFilters() method, > >> > > > > > because connectors can skip nested filters by putting them in > >> > > > > > Result#remainingFilters(). > >> > > > > > And this is backward-compatible because unknown expressions > were > >> > > added > >> > > > to > >> > > > > > the remaining filters. > >> > > > > > Planner should push predicate expressions as more as possible. > >> If > >> > we > >> > > > add > >> > > > > a > >> > > > > > flag for each new filter, > >> > > > > > the interface will be filled with lots of flags (e.g., > >> > > supportsBetween, > >> > > > > > supportsIN). > >> > > > > > > >> > > > > > 2. NestedFieldReferenceExpression#nestedFieldName should be an > >> > array > >> > > of > >> > > > > > field names? > >> > > > > > Each string represents a field name part of the field path. > Just > >> > keep > >> > > > > > aligning with `nestedFieldIndexArray`. > >> > > > > > > >> > > > > > 3. My concern about making FieldReferenceExpression support > >> nested > >> > > > fields > >> > > > > > is the compatibility. > >> > > > > > It is a public API and users/connectors are already using it. > >> > People > >> > > > > > assumed it is a top-level column > >> > > > > > reference, and applied logic on it. But that's not true now > and > >> > this > >> > > > may > >> > > > > > lead to unexpected errors. > >> > > > > > Having a separate NestedFieldReferenceExpression sounds safer > to > >> > me. > >> > > > > Mixing > >> > > > > > them in a class may > >> > > > > > confuse users what's the meaning of getFieldName() and > >> > > > getFieldIndex(). > >> > > > > > > >> > > > > > > >> > > > > > Regarding using NestedFieldReferenceExpression in > >> > > > > > SupportsProjectionPushDown, do you > >> > > > > > have any concerns @Timo Walther <[email protected]> ? > >> > > > > > > >> > > > > > Best, > >> > > > > > Jark > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Tue, 22 Aug 2023 at 05:55, Venkatakrishnan Sowrirajan < > >> > > > > [email protected] > >> > > > > > > > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Sounds like a great suggestion, Becket. +1. Agree with > >> cleaning > >> > up > >> > > > the > >> > > > > > APIs > >> > > > > > > and making it consistent in all the pushdown APIs. > >> > > > > > > > >> > > > > > > Your suggested approach seems fine to me, unless anyone else > >> has > >> > > any > >> > > > > > other > >> > > > > > > concerns. Just have couple of clarifying questions: > >> > > > > > > > >> > > > > > > 1. Do you think we should standardize the APIs across all > the > >> > > > pushdown > >> > > > > > > supports like SupportsPartitionPushdown, > >> SupportsDynamicFiltering > >> > > etc > >> > > > > in > >> > > > > > > the end state? > >> > > > > > > > >> > > > > > > The current proposal works if we do not want to migrate > >> > > > > > > > SupportsFilterPushdown to also use > >> > NestedFieldReferenceExpression > >> > > > in > >> > > > > > the > >> > > > > > > > long term. > >> > > > > > > > > >> > > > > > > Did you mean *FieldReferenceExpression* instead of > >> > > > > > > *NestedFieldReferenceExpression*? > >> > > > > > > > >> > > > > > > 2. Extend the FieldReferenceExpression to support nested > >> fields. > >> > > > > > > > - Change the index field type from int to int[]. > >> > > > > > > > >> > > > > > > - Add a new method int[] getFieldIndexArray(). > >> > > > > > > > - Deprecate the int getFieldIndex() method, the code > >> will > >> > be > >> > > > > > removed > >> > > > > > > in > >> > > > > > > > the next major version bump. > >> > > > > > > > >> > > > > > > I assume getFieldIndex would return fieldIndexArray[0], > right? > >> > > > > > > > >> > > > > > > Thanks > >> > > > > > > Venkat > >> > > > > > > > >> > > > > > > On Fri, Aug 18, 2023 at 4:47 PM Becket Qin < > >> [email protected] > >> > > > >> > > > > wrote: > >> > > > > > > > >> > > > > > > > Thanks for the proposal, Venkata. > >> > > > > > > > > >> > > > > > > > The current proposal works if we do not want to migrate > >> > > > > > > > SupportsFilterPushdown to also use > >> > NestedFieldReferenceExpression > >> > > > in > >> > > > > > the > >> > > > > > > > long term. > >> > > > > > > > > >> > > > > > > Did you mean *FieldReferenceExpression* instead of > >> > > > > > > *NestedFieldReferenceExpression*? > >> > > > > > > > >> > > > > > > > > >> > > > > > > > Otherwise, the alternative solution briefly mentioned in > the > >> > > > rejected > >> > > > > > > > alternatives would be the following: > >> > > > > > > > Phase 1: > >> > > > > > > > 1. Introduce a supportsNestedFilters() method to the > >> > > > > > > SupportsFilterPushdown > >> > > > > > > > interface. (same as current proposal). > >> > > > > > > > 2. Extend the FieldReferenceExpression to support nested > >> > fields. > >> > > > > > > > - Change the index field type from int to int[]. > >> > > > > > > > >> > > > > > > - Add a new method int[] getFieldIndexArray(). > >> > > > > > > > - Deprecate the int getFieldIndex() method, the code > >> will > >> > be > >> > > > > > removed > >> > > > > > > in > >> > > > > > > > the next major version bump. > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > 3. In the SupportsProjectionPushDown interface > >> > > > > > > > - add a new method > >> > > > > applyProjection(List<FieldReferenceExpression>, > >> > > > > > > > DataType), with default implementation invoking > >> > > > > > applyProjection(int[][], > >> > > > > > > > DataType) > >> > > > > > > > - deprecate the current applyProjection(int[][], > >> DataType) > >> > > > method > >> > > > > > > > > >> > > > > > > > Phase 2 (in the next major version bump) > >> > > > > > > > 1. remove the deprecated methods. > >> > > > > > > > > >> > > > > > > > Phase 3 (optional) > >> > > > > > > > 1. deprecate and remove the supportsNestedFilters() / > >> > > > > > > > supportsNestedProjection() methods from the > >> > > SupportsFilterPushDown > >> > > > / > >> > > > > > > > SupportsProjectionPushDown interfaces. > >> > > > > > > > > >> > > > > > > > Personally I prefer this alternative. It takes longer to > >> finish > >> > > the > >> > > > > > work, > >> > > > > > > > but the API eventually becomes clean and consistent. But I > >> can > >> > > live > >> > > > > > with > >> > > > > > > > the current proposal. > >> > > > > > > > > >> > > > > > > > Thanks, > >> > > > > > > > > >> > > > > > > > Jiangjie (Becket) Qin > >> > > > > > > > > >> > > > > > > > On Sat, Aug 19, 2023 at 12:09 AM Venkatakrishnan > Sowrirajan > >> < > >> > > > > > > > [email protected]> wrote: > >> > > > > > > > > >> > > > > > > > > Gentle ping for reviews/feedback. > >> > > > > > > > > > >> > > > > > > > > On Tue, Aug 15, 2023, 5:37 PM Venkatakrishnan > Sowrirajan < > >> > > > > > > > [email protected] > >> > > > > > > > > > > >> > > > > > > > > wrote: > >> > > > > > > > > > >> > > > > > > > > > Hi All, > >> > > > > > > > > > > >> > > > > > > > > > I am opening this thread to discuss FLIP-356: Support > >> > Nested > >> > > > > Fields > >> > > > > > > > > > Filter Pushdown. The FLIP can be found at > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!clxXJwshKpn559SAkQiieqgGe0ZduXCzUKCmYLtFIbQLmrmEEgdmuEIM8ZM1M3O_uGqOploU4ailqGpukAg$ > >> > > > > > > > > > > >> > > > > > > > > > This FLIP adds support for pushing down nested fields > >> > filters > >> > > > to > >> > > > > > the > >> > > > > > > > > > underlying TableSource. In our data lake, we find a > lot > >> of > >> > > > > datasets > >> > > > > > > > have > >> > > > > > > > > > nested fields and also user queries with filters > >> defined on > >> > > the > >> > > > > > > nested > >> > > > > > > > > > fields. This would drastically improve the performance > >> for > >> > > > those > >> > > > > > sets > >> > > > > > > > of > >> > > > > > > > > > queries. > >> > > > > > > > > > > >> > > > > > > > > > Appreciate any comments or feedback you may have on > this > >> > > > > proposal. > >> > > > > > > > > > > >> > > > > > > > > > Regards > >> > > > > > > > > > Venkata krishnan > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >
