Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

2024-10-03 Thread Benny Chow
Assuming the table contained smaller and better correlated files, I think a workaround where you materialized the timestamp difference between two columns could be effective for data file pruning. So if a particular planned departure date was associated with a lot of delays and the table was parti

Re: [DISCUSS] Iceberg Materialzied Views

2024-10-01 Thread Benny Chow
uire the lineage, I would propose to move ahead > without the lineage. Especially as this seems to be a problem with the View > Spec that we can't solve now. If there is a demand to add the lineage in > the future, once the catalog-alias problem has been solved, we can still > add it

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-27 Thread Benny Chow
mer must understand the dialect anyway. In fact, >> simply parsing the SQL definition seems like a more robust and >> straightforward solution than using a lineage for every representation. I >> believe this is why Benny suggested reverting to SQL parsing, and I agree >> with

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-19 Thread Benny Chow
rage table identifier was provided as part of the > MV definition? Sounds like a not very ideal UX. Note that it also conflicts > with the spirit of requirement #3. > > Thanks, > Walaa. > > On Thu, Sep 19, 2024 at 10:02 AM Benny Chow wrote: > >> Hi Jan >> >>

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-19 Thread Benny Chow
with a namespace and a > name field, like so: > > { > > namespace: ["bronze"], > > name: "lineitem" > > } > > And require the storage table to be in the same catalog as the MV itself? > > Thanks, > > Jan > On 19.09.24 00

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-18 Thread Benny Chow
two Nessie catalogs? They can't both > be called LocalNessie. > > Thanks, > > Jan > On 14.09.24 01:23, Benny Chow wrote: > > The main reason for putting the lineage into the view is so that "another" > engine can enumerate out the tables in the view withou

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-13 Thread Benny Chow
ame" of the >> identifier for a "Spark" dialect can be different then for a "Dremio" >> dialect. >> >> The important part is that we still have a list of identifiers for each >> representation that we can use with the catalog to obtain the state

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-10 Thread Benny Chow
se you can't store the catalog names of multiple representations >>>> in the lineage. You would need to fallback to parsing the SQL for a >>>> particular representation and rebuilding the full query tree to obtain the >>>> identifiers. >>>> >>&

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-08 Thread Benny Chow
Benny, `default-catalog` is optional, while `default-namespace` is >> required. >> >> I will retract my comment on the `summary`. it indicates the engine that >> made the revision to the current view version. it doesn't really matter for >> multi-engine/repres

Re: [DISCUSS] Iceberg Materialzied Views

2024-09-06 Thread Benny Chow
Hi Steven Yes, I definitely think #2 is easier and cleaner for both reader and writer and that lineage is a separate feature all together. There's no need to couple materialization state with view lineage. The other way to look at helping to decide between the two options is what is the most per

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Benny Chow
>>> For a refresh operation the query engine has to parse the SQL and >>>>>> fully expand the lineage with it's children anyway. So the lineage is >>>>>> not >>>>>> strictly required. >>>>>> >>>>&

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
ineage record that is stored as part of the view metadata? >>> >>> No, I don't think so, I think #5 is a reasonable requirement and I think >>> this violates it. >>> >>> >>>> 2. If yes, should the lineage in the view be fully expanded

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
If we go with either UUID or Table Identifier + VersionID/SnapshotId in the refresh state, then this list is fully expanded already. So, to validate the freshness of a materialization, the engine doesn't even need to look at the view lineage. IMO, the view lineage is nice to have but not a necess

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-14 Thread Benny Chow
refer it because we did not want to leak the SQL identifiers to the >> storage table since SQL identifiers are view concepts and fit better with >> the view. >> >> Thanks, >> Walaa. >> >> On Thu, Aug 8, 2024 at 4:12 PM Benny Chow wrote: >> >>> Maybe a th

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-08 Thread Benny Chow
Maybe a third option is to decouple the view lineage and materialization state. The view lineage can just list out the SQL identifiers+ref... we can still decide whether this is just direct children or fully expanded. The materialization state doesn't have to depend on the view lineage (through ei

Re: Iceberg MV Refresh

2024-06-24 Thread Benny Chow
ccessed, which may not be the > case in some producer/consumer scenarios. > > Best > PF > > > > > On Fri, 21 Jun 2024 at 18:28, Benny Chow wrote: > >> Hi Dan, looks like it is pretty common across engines and sometimes part >> of the engine specific DDL op

Re: Iceberg MV Refresh

2024-06-21 Thread Benny Chow
;> Thanks Benny for bringing these issues up. I would agree with both of >> your propositions. >> >> Regarding the naming of the fields, we can go with the naming that you >> suggested. I just wanted to wait if some more people chime in with their >> opinions. >&g

Re: Iceberg MV Refresh

2024-06-20 Thread Benny Chow
domain knowledge), or would it be on the user's side? In the latter case > the user would need to explicitly query the storage table directly, > correct? With a grace period I think we could push it down to the engine. > > Thanks, > Walaa. > > > On Thu, Jun 20, 2024 a

Re: Iceberg MV Refresh

2024-06-20 Thread Benny Chow
t; unlimited) >> - staleness clock starts with the first table change after refresh >> - for unmanaged (non-iceberg) tables where we don't know when the table >> changed, the staleness clock starts right after refresh >> >> Best >> Piotr >> >> >> &

Iceberg MV Refresh

2024-06-19 Thread Benny Chow
Hey Guys, Great progress on the MV spec and thanks a ton to Jan and Walaa for driving this. One of our latest achievements was that we finalized the view lineage and materialization table refresh JSON so that we can definitively and concisely describe what data is in the materialization table. R

Re: Summary of Iceberg Materialized View Meeting

2024-06-07 Thread Benny Chow
was produced by >> the update). We will follow up on that separately. >> >> Jan, do you want to reflect the lineage + state discussion in the doc >> so we can iterate on the lineage JSON structure? >> >> Thanks, >> Walaa. >> >> >> On

Re: Summary of Iceberg Materialized View Meeting

2024-06-06 Thread Benny Chow
I really enjoyed listening to the replay and hearing everyone's feedback! I'm in agreement with all 3 consensus items, especially around Dan's idea to separate the view's query tree lineage vs materialization's lineage state. I'll summarize my understanding about the distinction and add a few comm

Re: Iceberg Materialized View Meeting

2024-06-03 Thread Benny Chow
Thanks for organizing Jan. I’ll be there! Benny > On Jun 3, 2024, at 11:15 PM, Jan Kaul wrote: > >  > Hi all, > > we will have a video call to get together and discuss Iceberg Materialized > Views. The call is on Wednesday, 5 June 2024, 16:00:00 UTC (9:00 PDT) and you > can join the meeti

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

2024-05-28 Thread Benny Chow
It's interesting to note that a tabular SQL UDF can be used to build a *parameterized *view. So, there's definitely a lot in common between UDFs and views. Thanks On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa wrote: > I think there is a disconnect about what is perceived as a "UDF". The

Re: Materialized Views: Next Steps

2024-05-17 Thread Benny Chow
1:35 PM, Walaa Eldin Moustafa wrote:Sounds good. I am assuming we agree it is not required for either snapshot or timestamp?Thanks,Walaa.On Fri, May 17, 2024 at 1:17 PM Benny Chow <btc...@gmail.com> wrote:I like Jack's suggestions to capture the ref type and value!  When the ref typ

Re: Materialized Views: Next Steps

2024-05-16 Thread Benny Chow
drift). > > If we have feedback on the actual properties used in the properties model > as defined in the PR, we can have the discussion there. > > THanks, > Walaa. > > > On Thu, May 16, 2024 at 3:22 PM Benny Chow wrote: > >> Hi Walaa >> >> I left co

Re: Materialized Views: Next Steps

2024-05-16 Thread Benny Chow
ceberg metadata fields > as engine properties) just for the lack of other cleaner options does not > sound like a good idea in both short and long term. > > Let me know your thoughts. > > Thanks, > Walaa. > > > > On Tue, May 14, 2024 at 5:12 PM Benny Chow wrote: >

Re: Materialized Views: Next Steps

2024-05-14 Thread Benny Chow
I agree with Szheon here.  I think storing the materialization lineage as a bunch of properties is brittle.  This lineage information is needed by engines to validate the staleness of a materialization and also to perform full or incremental refreshes.  There’s a lot to capture here. Maybe we shoul

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-18 Thread Benny Chow
+1 for separate view and table objects. Walaa's Spark implementation demonstrates how little change it takes on the Iceberg APIs to start sharing MVs between engines. Thanks Benny On Thu, Apr 18, 2024 at 9:52 AM Walaa Eldin Moustafa wrote: > Hi everyone, > > I would like to make a proposal for

Re: Materialized view integration with REST spec

2024-03-25 Thread Benny Chow
Hi Manu This is Walaa's Spark implementation for option 1: https://github.com/apache/iceberg/pull/9830/files/a9e1bee3b5bf5914e5330d3b195042aea33868c9 There's no code for option 2 yet. Best Benny On Mon, Mar 25, 2024 at 12:37 AM Manu Zhang wrote: > Thanks Walaa for the summary. It's unclear to

Re: MV Query Planning Use Case

2024-03-09 Thread Benny Chow
for your message. I think the idea is to "smoothly" > (implicitly) add regular table storage over a view. > > The MV approach is right now in discussion, without consensus so far. > We plan to have document/meeting to discuss further. > > Regards > JB > > On

MV Query Planning Use Case

2024-03-07 Thread Benny Chow
Hey Everyone I've been following the MV spec and listened in on the last community sync. I'd like to chime in from a query planner point of view on how the MVs could be used. Suppose a user has a dashboard query like: *SELECT product, sum(sales) * *FROM view1 * *WHERE brand = 'X' and year = '20