Re: Proposal for Implementation of the Proposed Iceberg REST Spec - Events API
I was going to make the same suggestion :) For deployments where the complexity of an extra buffer doesn’t make sense, a blocking persistence call will make things easier for the services owner. If read events are added, a high throughput service will need the buffering probably. Though , I would be surprised if those events end up being served to callers for the /events API. For read events, I think something like OpenLineage event reporting is a better option and most callers will want to know about modifications, so filtering all the reads to find the valuable changes is a headache I would guess most service owners would rather avoid. Mike On Thu, May 22, 2025 at 12:27 AM Adnan Hemani wrote: > I haven’t thought of this in depth honestly - but I could see this being > the case. > > -Adnan > > > On May 21, 2025, at 4:39 PM, Eric Maynard > wrote: > > > > -devlist > > > > If we design by interface properly, it should be relatively easy to offer > > both a disk buffering and an always-write implementation right? > > > > On Thu, May 22, 2025 at 12:12 AM Adnan Hemani > > wrote: > > > >> Hi all, > >> > >> Thanks for sharing these thoughts. I’m also not completely sure about > how > >> much we should care about how much slower things will be if we just > make a > >> trip to the persistence on every write action. However, I’m building > this > >> feature with the intention of being able to also support read event > types > >> in the near future, if this is something that the customer is > interested in > >> enabling using the `CustomOperation` type that is defined in the Events > API > >> spec. Of course, this would need to be configured by the administrator, > as > >> maintenance of the persistence is their responsibility. > >> > >> Given that the Iceberg Events API spec has not yet merged and can still > >> see some changes, I’m planning to begin work on the disk buffering now > and > >> wait for the Events API to finalize before working on the API side of > the > >> end-to-end implementation. > >> > >> Best, > >> Adnan Hemani > >> > >>> On May 20, 2025, at 10:30 PM, Michael Collado > >> wrote: > >>> > >>> That’s super interesting. Glad this is being worked on. Personally, I > >> don’t > >>> know that the latency for writing events to a persistent storage is > >> really > >>> all that concerning. Looking at the enum of supported operations, only > >>> write operations seem to trigger the event. It’s not like every read > >>> request issues a new event. Given that the request latency here is > >>> dominated by cloud storage calls, do we really care about one extra > call > >> to > >>> Postgres? Personally, I’d skip the extra complexity of a buffer of any > >> kind > >>> and just write straight to the persistence store. > >>> > >>> Mike > >>> > >>> On Tue, May 20, 2025 at 9:31 AM Yufei Gu >> flyrain...@gmail.com>> wrote: > >>> > Looks awesome. Thanks for taking the lead! It makes sense to use a > JDBC-backed persistence layer, shared or separate. The optional > >> retention > period is a nice safeguard. > I don’t see any blockers on my side. If no one raises major concerns > >> this > week, please go ahead and start the implementation. Exciting to see > this > coming together! > > Yufei > > > On Tue, May 13, 2025 at 6:37 PM Adnan Hemani > wrote: > > > Hi all, > > > > I am raising a proposal to implement the proposed Iceberg REST > > specification for the Events API (doc < > > > > >> > https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://docs.google.com/document/d/1WtIsNGVX75-_MsQIOJhXLAWg6IbplV4-DkLllQEiFT8/edit?pli%253D1%2526tab%253Dt.0%26source%3Dgmail-imap%26ust%3D174841032700%26usg%3DAOvVaw0UNQYKbXoQ2YHVM7J0kB3l&source=gmail-imap&ust=174847561300&usg=AOvVaw38VwFquueiVwCdlo_xR2rB > > , > > GH < > >> > https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/apache/iceberg/pull/12584/files%26source%3Dgmail-imap%26ust%3D174841032700%26usg%3DAOvVaw1AvwLK402voAm_j6zy25Mn&source=gmail-imap&ust=174847561300&usg=AOvVaw2IwxUir8jCegcKC47zv1Si > >). > >> It is my > > understanding that this proposal is close and that we will be > required > >> to > > implement something very close to the current proposal in the near > future. > > > > If Polaris is to implement this API, it will likely need to be > through > >> a > > Persistence instance that the Polaris server can query instantly, as > >> this > > API will not be asynchronous. Please note, this proposal is not to > comment > > on what events we may emit today or in the future - the scope of this > > proposal is solely to discuss how we plan to implement the proposed > Events > > API. > > > > Changes to be made: > > > > Implement Event storage through the Polaris Persistence layer > > > > We will store events in a persistence instance of
Re: [Discuss] Add `location` to generic table spec
Yun's proposal (quoted below) looks acceptable to me. I see that Eric commented about multiple locations in his reply. Accounting for multiple locations upfront would make sense to me too (I believe I mentioned that earlier), but I'm also fine with a singular location for a start, if it helps with getting a POC implementation off the ground. Re: ForbiddenException - I'm not sure this exact name is correct for location overlap errors. Forbidden in HTTP/REST usually means auth or access errors. I'd suggest Conflict. Cheers, Dmitri. On Wed, May 21, 2025 at 7:32 PM yun zou wrote: > Hi All, > > Want to summarize the thread here: > > For generic tables, we will add a `location` key to help cross engine > sharing and future support for credential vending. > > Here is a description about this `location` key and corresponding > restrictions and responsibilities: > - `location`(OPTIONAL): table root location in URI format. For example: > s3:///path/to/table. > - The table root location is a location that includes all files for the > table. > - Clients (engines) are responsible to make sure all files are written > under the configured location. > - A table with multiple root locations (i.e. containing files that are > outside the configured root location) is not compliant with the current > generic table support in Polaris. > - No two tables can have the same or overlapped location, otherwise, a > ForbiddenException will be thrown on creation. > - If no location is provided, clients or users are responsible to manage > the location and location related concerns such as path conflict check etc. > - The location configuration can not be updated once the table is > created. > > This description will be added into the spec. In order to help non-API > users to discover the information easily, we will also get a site page to > describe the support > for Generic Table and key fields. > > Best Regards, > Yun > > On Mon, May 19, 2025 at 11:16 PM yun zou > wrote: > > > Hi Dmitri, > > > > " I do not think those doc comments provide enough visibility to ensure > > that the key information > > is received by users, unless they are dealing directly with the API" > > -- Yeah, I agree those information may not be visible enough for users > who > > don't directly work with APIs. > > However, I think just having one page for "location" might be a little > bit > > overkill. Given that generic table API support is > > a new catalog capabilities that Polaris added which is not IRC, I think > it > > might worth having a more general page to > > describe the Polaris Generic Table support and describe some of the > > critical fields like *location*. > > I think we should have the description in the spec also, so that things > > could be clear for API users. > > > > Please let me know what you think. > > > > Best Regards, > > Yun > > > > On Mon, May 19, 2025 at 4:22 PM Dmitri Bourlatchkov > > wrote: > > > >> I believe the Open API spec and the definition of "location" are > slightly > >> different concerns. > >> > >> The former is about the API used to obtain information about Generic > >> Tables. > >> > >> The latter is about the interpretation of that information. One can > think > >> of the location > >> value being handled / transferred beyond the immediate Polaris client, > in > >> which case > >> is loses its connection to the API, but does not lose its meaning as a > >> location of a > >> Generic Table. > >> > >> Also, I think that Open API doc comments are too low-level and too > obscure > >> for > >> people who will work with processing actual Generic Table files. I do > not > >> think > >> those doc comment provide enough visibility to ensure that the key > >> information > >> is received by users, unless they are dealing directly with the API. > >> > >> That said, if you prefer to keep the finer points about Generic Table > >> locations in the > >> Open API spec, I'd be fine with that. > >> > >> Cheers, > >> Dmitri. > >> > >> On Mon, May 19, 2025 at 6:46 PM yun zou > >> wrote: > >> > >> > Hi Dmitri, > >> > > >> > Thanks for the detailed explanation, I definitely agree we need to > call > >> out > >> > those restrictions and compliance in our Spec. > >> > > >> > As for the documentation, Polaris today already publishes the API > spec, > >> if > >> > you go to page https://polaris.apache.org/in-dev/unreleased/, > >> > and click on the Catalog API Spec, it will lead you to the published > >> Spec, > >> > which contains all description in the Spec. > >> > That basically means we have both published doc and spec code, and the > >> > single source of truth is the description in the doc. > >> > or do you think we should have an extra page for the Generic Table API > >> > spec? > >> > > >> > Best Regards, > >> > Yun > >> > > >> > On Mon, May 19, 2025 at 3:20 PM Yufei Gu > wrote: > >> > > >> > > > > >> > > > * Clients (engines) are responsible for writing files only under > the > >> > > > specified location. > >> > >
Re: [Discuss] Add `location` to generic table spec
Hi Yun, [...] I think just having one page for "location" might be a little bit overkill. Given that generic table API support is a new catalog capabilities that Polaris added which is not IRC, I think it might worth having a more general page to describe the Polaris Generic Table support and describe some of the critical fields like *location*. This is exactly what I meant: adding a doc page for the Generic Table feature as a whole. Cheers, Dmitri. On Tue, May 20, 2025 at 2:16 AM yun zou wrote: > Hi Dmitri, > > " I do not think those doc comments provide enough visibility to ensure > that the key information > is received by users, unless they are dealing directly with the API" > -- Yeah, I agree those information may not be visible enough for users who > don't directly work with APIs. > However, I think just having one page for "location" might be a little bit > overkill. Given that generic table API support is > a new catalog capabilities that Polaris added which is not IRC, I think it > might worth having a more general page to > describe the Polaris Generic Table support and describe some of the > critical fields like *location*. > I think we should have the description in the spec also, so that things > could be clear for API users. > > Please let me know what you think. > > Best Regards, > Yun > > On Mon, May 19, 2025 at 4:22 PM Dmitri Bourlatchkov > wrote: > > > I believe the Open API spec and the definition of "location" are slightly > > different concerns. > > > > The former is about the API used to obtain information about Generic > > Tables. > > > > The latter is about the interpretation of that information. One can think > > of the location > > value being handled / transferred beyond the immediate Polaris client, in > > which case > > is loses its connection to the API, but does not lose its meaning as a > > location of a > > Generic Table. > > > > Also, I think that Open API doc comments are too low-level and too > obscure > > for > > people who will work with processing actual Generic Table files. I do not > > think > > those doc comment provide enough visibility to ensure that the key > > information > > is received by users, unless they are dealing directly with the API. > > > > That said, if you prefer to keep the finer points about Generic Table > > locations in the > > Open API spec, I'd be fine with that. > > > > Cheers, > > Dmitri. > > > > On Mon, May 19, 2025 at 6:46 PM yun zou > > wrote: > > > > > Hi Dmitri, > > > > > > Thanks for the detailed explanation, I definitely agree we need to call > > out > > > those restrictions and compliance in our Spec. > > > > > > As for the documentation, Polaris today already publishes the API spec, > > if > > > you go to page https://polaris.apache.org/in-dev/unreleased/, > > > and click on the Catalog API Spec, it will lead you to the published > > Spec, > > > which contains all description in the Spec. > > > That basically means we have both published doc and spec code, and the > > > single source of truth is the description in the doc. > > > or do you think we should have an extra page for the Generic Table API > > > spec? > > > > > > Best Regards, > > > Yun > > > > > > On Mon, May 19, 2025 at 3:20 PM Yufei Gu wrote: > > > > > > > > > > > > > * Clients (engines) are responsible for writing files only under > the > > > > > specified location. > > > > > > > > It's nice to have a doc like that. But the open API spec is *the* > place > > > to > > > > define the behavior of client and server, and how they interact with > > each > > > > other. Just as we said before, spec change is recommended to have a > ML > > > > discussion. > > > > > > > > * A table, whose files exist outside the declared location, is not > > > > > compliant with the Polaris' definition for a Generic Table. > > > > > > > > I'm not sure we should go that far. "location" is an optional field. > > It's > > > > just some features like credential vending that don't work if > > "location" > > > is > > > > missing. > > > > > > > > Yufei > > > > > > > > > > > > On Mon, May 19, 2025 at 2:59 PM Dmitri Bourlatchkov < > di...@apache.org> > > > > wrote: > > > > > > > > > As I commented in my other recent email, I think by introducing a > > > > > "location" property Polaris enters the realm of table format specs. > > > > > > > > > > This is fine, from my POV, however, since Polaris is the defining > > > project > > > > > behind that property, I believe Polaris should provide a more > > > definitive > > > > > description of the meaning and intended processing of that > property. > > > > > > > > > > To repeat myself, I think the Open API spec defines only the API > for > > > > > obtaining the location. We need a place to define what this > location > > > > means. > > > > > I do not insist on calling this a "spec" for Generic Tables, but I > > > think > > > > it > > > > > deserves a separate page in Polaris docs, where it would be defined > > > with > > > > > more rigor. > > > > > > > >
Re: [Discuss] Add `location` to generic table spec
> Can we keep it simple for v1 [...] What is v1 in this context? Thanks, Dmitri. On Wed, May 21, 2025 at 8:42 PM Yufei Gu wrote: > Can we keep it simple for v1, as one location field is enough for today’s > use cases? And we can revisit multi-location support when there’s real > demand. > > The current API spec already implies that a table’s location is immutable, > there’s no “alter location” call. I’m fine leaving it implicit, but we > could add an explicit note to make that clear if it helps avoid confusion. > > Yufei > > > On Wed, May 21, 2025 at 4:36 PM Eric Maynard > wrote: > > > No two tables globally can have a location overlap? That’s a stricter > > requirement than we have for even Iceberg tables and doesn’t sound > correct. > > > > Similarly, the restriction that you can’t change location is stricter > than > > what we have for Iceberg. > > > > Finally, I’m still not sure what the problem is with having multiple > > locations. Again, we already track multiple locations for Iceberg. > > > > On Thu, May 22, 2025 at 12:32 AM yun zou > > wrote: > > > > > Hi All, > > > > > > Want to summarize the thread here: > > > > > > For generic tables, we will add a `location` key to help cross engine > > > sharing and future support for credential vending. > > > > > > Here is a description about this `location` key and corresponding > > > restrictions and responsibilities: > > > - `location`(OPTIONAL): table root location in URI format. For example: > > > s3:///path/to/table. > > > - The table root location is a location that includes all files for > the > > > table. > > > - Clients (engines) are responsible to make sure all files are > written > > > under the configured location. > > > - A table with multiple root locations (i.e. containing files that > are > > > outside the configured root location) is not compliant with the current > > > generic table support in Polaris. > > > - No two tables can have the same or overlapped location, otherwise, > a > > > ForbiddenException will be thrown on creation. > > > - If no location is provided, clients or users are responsible to > > manage > > > the location and location related concerns such as path conflict check > > etc. > > > - The location configuration can not be updated once the table is > > > created. > > > > > > This description will be added into the spec. In order to help non-API > > > users to discover the information easily, we will also get a site page > to > > > describe the support > > > for Generic Table and key fields. > > > > > > Best Regards, > > > Yun > > > > > > On Mon, May 19, 2025 at 11:16 PM yun zou > > > wrote: > > > > > > > Hi Dmitri, > > > > > > > > " I do not think those doc comments provide enough visibility to > ensure > > > > that the key information > > > > is received by users, unless they are dealing directly with the API" > > > > -- Yeah, I agree those information may not be visible enough for > users > > > who > > > > don't directly work with APIs. > > > > However, I think just having one page for "location" might be a > little > > > bit > > > > overkill. Given that generic table API support is > > > > a new catalog capabilities that Polaris added which is not IRC, I > think > > > it > > > > might worth having a more general page to > > > > describe the Polaris Generic Table support and describe some of the > > > > critical fields like *location*. > > > > I think we should have the description in the spec also, so that > things > > > > could be clear for API users. > > > > > > > > Please let me know what you think. > > > > > > > > Best Regards, > > > > Yun > > > > > > > > On Mon, May 19, 2025 at 4:22 PM Dmitri Bourlatchkov < > di...@apache.org> > > > > wrote: > > > > > > > >> I believe the Open API spec and the definition of "location" are > > > slightly > > > >> different concerns. > > > >> > > > >> The former is about the API used to obtain information about Generic > > > >> Tables. > > > >> > > > >> The latter is about the interpretation of that information. One can > > > think > > > >> of the location > > > >> value being handled / transferred beyond the immediate Polaris > client, > > > in > > > >> which case > > > >> is loses its connection to the API, but does not lose its meaning > as a > > > >> location of a > > > >> Generic Table. > > > >> > > > >> Also, I think that Open API doc comments are too low-level and too > > > obscure > > > >> for > > > >> people who will work with processing actual Generic Table files. I > do > > > not > > > >> think > > > >> those doc comment provide enough visibility to ensure that the key > > > >> information > > > >> is received by users, unless they are dealing directly with the API. > > > >> > > > >> That said, if you prefer to keep the finer points about Generic > Table > > > >> locations in the > > > >> Open API spec, I'd be fine with that. > > > >> > > > >> Cheers, > > > >> Dmitri. > > > >> > > > >> On Mon, May 19, 2025 at 6:46 PM yun zou > > > > >> wrote: > > >
[HEADSUP] Apache Polaris 0.10.0-beta-incubating rc4 prep
Hi everyone Just a quick update: I’m preparing rc4. The vote will start tomorrow. Thanks ! Regards JB
Re: [Discuss] Add `location` to generic table spec
Inlined. On Thu, May 22, 2025 at 7:48 AM Dmitri Bourlatchkov wrote: > > Can we keep it simple for v1 [...] > > What is v1 in this context? > I meant as the first iteration, sorry for the confusion. > > Thanks, > Dmitri. > > On Wed, May 21, 2025 at 8:42 PM Yufei Gu wrote: > > > Can we keep it simple for v1, as one location field is enough for today’s > > use cases? And we can revisit multi-location support when there’s real > > demand. > > > > The current API spec already implies that a table’s location is > immutable, > > there’s no “alter location” call. I’m fine leaving it implicit, but we > > could add an explicit note to make that clear if it helps avoid > confusion. > > > > Yufei > > > > > > On Wed, May 21, 2025 at 4:36 PM Eric Maynard > > wrote: > > > > > No two tables globally can have a location overlap? That’s a stricter > > > requirement than we have for even Iceberg tables and doesn’t sound > > correct. > > > > > > Similarly, the restriction that you can’t change location is stricter > > than > > > what we have for Iceberg. > > > > > > Finally, I’m still not sure what the problem is with having multiple > > > locations. Again, we already track multiple locations for Iceberg. > > > > > > On Thu, May 22, 2025 at 12:32 AM yun zou > > > wrote: > > > > > > > Hi All, > > > > > > > > Want to summarize the thread here: > > > > > > > > For generic tables, we will add a `location` key to help cross engine > > > > sharing and future support for credential vending. > > > > > > > > Here is a description about this `location` key and corresponding > > > > restrictions and responsibilities: > > > > - `location`(OPTIONAL): table root location in URI format. For > example: > > > > s3:///path/to/table. > > > > - The table root location is a location that includes all files for > > the > > > > table. > > > > - Clients (engines) are responsible to make sure all files are > > written > > > > under the configured location. > > > > - A table with multiple root locations (i.e. containing files that > > are > > > > outside the configured root location) is not compliant with the > current > > > > generic table support in Polaris. > > > > - No two tables can have the same or overlapped location, > otherwise, > > a > > > > ForbiddenException will be thrown on creation. > > > > - If no location is provided, clients or users are responsible to > > > manage > > > > the location and location related concerns such as path conflict > check > > > etc. > > > > - The location configuration can not be updated once the table is > > > > created. > > > > > > > > This description will be added into the spec. In order to help > non-API > > > > users to discover the information easily, we will also get a site > page > > to > > > > describe the support > > > > for Generic Table and key fields. > > > > > > > > Best Regards, > > > > Yun > > > > > > > > On Mon, May 19, 2025 at 11:16 PM yun zou > > > > > wrote: > > > > > > > > > Hi Dmitri, > > > > > > > > > > " I do not think those doc comments provide enough visibility to > > ensure > > > > > that the key information > > > > > is received by users, unless they are dealing directly with the > API" > > > > > -- Yeah, I agree those information may not be visible enough for > > users > > > > who > > > > > don't directly work with APIs. > > > > > However, I think just having one page for "location" might be a > > little > > > > bit > > > > > overkill. Given that generic table API support is > > > > > a new catalog capabilities that Polaris added which is not IRC, I > > think > > > > it > > > > > might worth having a more general page to > > > > > describe the Polaris Generic Table support and describe some of the > > > > > critical fields like *location*. > > > > > I think we should have the description in the spec also, so that > > things > > > > > could be clear for API users. > > > > > > > > > > Please let me know what you think. > > > > > > > > > > Best Regards, > > > > > Yun > > > > > > > > > > On Mon, May 19, 2025 at 4:22 PM Dmitri Bourlatchkov < > > di...@apache.org> > > > > > wrote: > > > > > > > > > >> I believe the Open API spec and the definition of "location" are > > > > slightly > > > > >> different concerns. > > > > >> > > > > >> The former is about the API used to obtain information about > Generic > > > > >> Tables. > > > > >> > > > > >> The latter is about the interpretation of that information. One > can > > > > think > > > > >> of the location > > > > >> value being handled / transferred beyond the immediate Polaris > > client, > > > > in > > > > >> which case > > > > >> is loses its connection to the API, but does not lose its meaning > > as a > > > > >> location of a > > > > >> Generic Table. > > > > >> > > > > >> Also, I think that Open API doc comments are too low-level and too > > > > obscure > > > > >> for > > > > >> people who will work with processing actual Generic Table files. I > > do > > > > not > > > > >> think > > > > >> those doc comment provid
Re: [DISCUSS] Standardize nullness annotations
Generally I agree on the proposal as well. Although I admit that I'm on the fence there - neither pro nor con. What I think would be quite useful is that the IDE respects the non-null-default - not sure how good jspecify's @NullMarked/NullUnmarked is being supported nowadays. There was another one in jsr305 - but jsr305 is history nowadays. IIRC it didn't work for Java modules a while back, useful to only specify that annotation once for a whole module and not for every package. WRT REST, I think we can get away with jakarta-validation. Also agree that Optional is generally fine and the preferred way. There can be legit exceptions though. On 22.05.25 18:12, Dmitri Bourlatchkov wrote: I agree with what Alex proposed (quoted below). Re: "assuming non-null by default". I believe this was intended in the spirit of code annotations, meaning that we need not use @Nonnull everywhere. I do not think this was intended to mean that code does not have to check that the value is not null if there's reason to believe it might be. That said, I think it makes sense to use @Nullable to indicate values that can be null during normal execution. I'd propose to use @Nonnull only to remove uncertainty (this should rarely be needed). As for null checks, I think it makes sense to perform them when parameters are accepted from users or external calls (REST API, config), but otherwise code can propagate values normally. If the value is nullable use @Nullable otherwise no annotation is needed. Re: Optional: My take is that it makes flowable coding easier (.map() instead of `if`), so I think it is fine to use it both in parameter and return values. I do not think we need to "force" its usage, though. Cheers, Dmitri. On Wed, May 21, 2025 at 10:02 AM Alex Dutra wrote: Hi all, A while ago, we had a discussion regarding which nullness annotations to use and whether we should consider favoring non-null by default. I would like to revive that discussion. We are currently using the `jakarta.annotation` package consistently, but the defaults are not clear: should we consider everything as non-null by default and only annotate the nullable things, or the other way around? Some classes are cluttered with both annotations, which seems unnecessary and confusing. I would personally be in favor of considering everything as non-null by default. Please let me know your thoughts. Alex -- Robert Stupp @snazy
Re: [DISCUSS] Standardize nullness annotations
I agree with what Alex proposed (quoted below). Re: "assuming non-null by default". I believe this was intended in the spirit of code annotations, meaning that we need not use @Nonnull everywhere. I do not think this was intended to mean that code does not have to check that the value is not null if there's reason to believe it might be. That said, I think it makes sense to use @Nullable to indicate values that can be null during normal execution. I'd propose to use @Nonnull only to remove uncertainty (this should rarely be needed). As for null checks, I think it makes sense to perform them when parameters are accepted from users or external calls (REST API, config), but otherwise code can propagate values normally. If the value is nullable use @Nullable otherwise no annotation is needed. Re: Optional: My take is that it makes flowable coding easier (.map() instead of `if`), so I think it is fine to use it both in parameter and return values. I do not think we need to "force" its usage, though. Cheers, Dmitri. On Wed, May 21, 2025 at 10:02 AM Alex Dutra wrote: > Hi all, > > A while ago, we had a discussion regarding which nullness annotations > to use and whether we should consider favoring non-null by default. I > would like to revive that discussion. > > We are currently using the `jakarta.annotation` package consistently, > but the defaults are not clear: should we consider everything as > non-null by default and only annotate the nullable things, or the > other way around? Some classes are cluttered with both annotations, > which seems unnecessary and confusing. > > I would personally be in favor of considering everything as non-null by > default. > > Please let me know your thoughts. > > Alex >
Re: Merge module polaris-quarkus-admin and polaris-quarkus-server
I have to disagree here. In this thread, we discuss that *Polaris ships as a binary bundle, not a pre-built Kubernetes chart.* That bundle includes both the server and the admin tool for a reason: it guarantees version compatibility and gives every user a friction-free way to bootstrap a catalog, no matter where they run. If an operator wants the leanest possible container image for production, nothing stops them from building two images—one “server-only,” one “admin-only”—from the same bundle. That split is straightforward and entirely under the deployer’s control. The admin binary adds a few megabytes, noise compared with the JRE and native libraries we ship anyway. The gain in operational simplicity outweighs that marginal footprint. If an attacker can already execute arbitrary binaries inside the server container, the cluster is compromised with or without admin-tool. They can run "rm -rf *", etc. There are multiple precedents like Spark, Kafka, ZooKeeper, Flink, and even PostgreSQL all ship “server + tools” in the same distribution tarball. SRE or operators can strip or relocate what they don’t need for their runtime images. So let’s keep the all-in-one distribution. Those who deploy into strict environments can (and should) build slimmer images, but that’s a packaging choice best left to the operator, not an upstream fragmentation of our release artifacts. Yufei On Mon, May 19, 2025 at 5:17 AM Robert Stupp wrote: > Polaris-server and the admin-tool are separate things. You deploy the > /server/ and let it run in k8s (or whatever). Bootstrapping via the > admin-tools happens rarely (once) and is rather performed from an > administrator's machine, whereas the server(s) run elsewhere. So server > and admin-tool are run in very different ways. > > The admin-tool is (at best) just overhead in a server and increases the > size of the image + container. Playing devil's advocate here, having the > admin-tool executable accessible from the server gives easy access to > the purge functionality. > > I also see issues in how users interact with a single image containing > both the server and admin-tool w/o accidentally starting another Polaris > server instance. Starting the server on an administrator's machine can > likely start running tasks, which can run into failures, because you may > not have the necessary object-stores' credentials. > > Configuration settings are pushed to the admin-tool and server in the > same way. So you're probably already using the same (k8s) values.yaml. > > Overall I'm not sure whether this approach really simplifies things or > rather introduces more complexity and potentially risks. > > > On 15.05.25 07:00, Yufei Gu wrote: > > Not at all. Thanks for the review, actually! > > > > I think it's a good idea to name it polaris-bin-{version}.tgz or > > polaris-{version}-bin.tgz, which indicates that it's a binary > > distribution, not a source distribution. Let me make the change. > > > > Yufei > > > > > > On Wed, May 14, 2025 at 2:31 PM Dmitri Bourlatchkov > > wrote: > > > >> Thanks, Yufei! I added some comments in GH despite the "draft" PR > status. I > >> hope you do not mind :) > >> > >> Overall, using one distribution archive both for the server and the > admin > >> tool makes sense to me. > >> > >> 4. Standardized the binary distribution package naming to > >> polaris-{version}.tgz and polaris-{version}.zip, following common > >> conventions used by other projects (e.g., > spark-3.5.5-bin-hadoop3.tgz). > >> > >> > >> Would it make sense to use archive names like polaris-1.0.0-bin.tgz (to > >> follow Spark's example closer)? > >> > >> Thanks, > >> Dmitri. > >> > >> On Wed, May 14, 2025 at 2:32 PM Yufei Gu wrote: > >> > >>> Hi Folk, > >>> Thanks a lot for the discussion. Here is the PR( > >>> https://github.com/apache/polaris/pull/1589) to merge the binary > >>> distribution while still keeping module polaris-quarkus-admin and > >>> polaris-quarkus-server separately. > >>> > >>> *What’s included in thePR:* > >>> > >>> 1. Introduced a new module that combines binary distribution for > both > >>> Admin Tools and Server. Please note that source and jar are still > >>> separated. Please refer to my original email for the motivation > behind > >>> this > >>> change. > >>> 2. Removed the now-redundant run-scripts module. > >>> 3. Consolidated the README to reflect the unified binary > distribution. > >>> 4. Standardized the binary distribution package naming to > >>> polaris-{version}.tgz and polaris-{version}.zip, following common > >>> conventions used by other projects (e.g., > >> spark-3.5.5-bin-hadoop3.tgz). > >>> *TODOs*: > >>> > >>> 1. Consolidate LICENSE and NOTICE files from both Admin Tools and > >>> Server. > >>> 2. Remove the distribution tasks in each of the original modules. > >>> > >>> The PR is technically ready, but I plan to wait until the 0.10 release > is > >>> finalized to avoid disrupting the release pr
Re: [DISCUSS] Standardize nullness annotations
Hi, My proposal was centered around compile-time checks and targets mostly developers and contributors. I am not questioning the usefulness of runtime checks when these make sense. Maybe an example is better than a thousand words. Let's imagine a simple getOrDefault() method. Which version do you prefer? 1. Annotate only nullable items: public String getOrDefault(@Nullable String s, String def) { return s == null ? def : Objects.requireNonNull(def); } 2. Annotate only non-null items: @Nonnull public String getOrDefault(String s, @Nonnull String def) { return s == null ? def : Objects.requireNonNull(def); } 3. Annotate everything: @Nonnull public String getOrDefault(@Nullable String s, @Nonnull String def) { return s == null ? def : Objects.requireNonNull(def); } Many places in Polaris are using option 3, which is too verbose and leads to visual fatigue. What I was suggesting to the community is to adopt option 1, that reduces the visual clutter and also assumes non-null by default. (You will notice that I added a runtime check to all three versions.) Hope that helps to clarify the discussion. Alex On Wed, May 21, 2025 at 8:54 PM Yufei Gu wrote: > > > > > In my opinion, assuming everything is nullable by default isn't the > > best approach for writing robust code. I believe a bias towards > > non-nullness leads to more reliable systems. > > I agreed with the intention, but am concerned that assuming everything is > non-nullness may discourage null-checking, which is problematic as runtime > null-checking isn't a thing. > > Yufei > > > On Wed, May 21, 2025 at 9:52 AM Alex Dutra > wrote: > > > Hi Eric, > > > > You're right that annotations don't change the bytecode at runtime. > > Their real value comes during compilation, as many static analysis > > tools use them to flag potential issues. They can even cause build > > failures depending on how you configure them. My IDE (IntelliJ) > > frequently warns me when I forget to handle a potential NPE; if you're > > not seeing similar feedback, it might be worth checking your IDE > > settings. > > > > While the annotations are primarily for compile-time checks, that > > doesn't mean we can't also incorporate runtime checks. We should aim > > to include these whenever this makes sense, for example by using > > Guava's Preconditions. This is especially useful if we can't guarantee > > that a method parameter, for instance, will never be null, because > > it's being provided by some external system. > > > > In my opinion, assuming everything is nullable by default isn't the > > best approach for writing robust code. I believe a bias towards > > non-nullness leads to more reliable systems. > > > > I am also a big fan of Optional and think we should strive to use it > > as much as possible. That said, it's not always possible, especially > > if you are implementing a third-party interface that doesn't use it. > > Using Optional in class fields and method parameters is also > > controversial: Optional was designed primarily as a signal from the > > callee to the caller, to signify: "no result". In other words, its > > main purpose is to clarify method return types. This post on Stack > > Overflow by Brian Goetz is worth reading: [1]. > > > > Thanks, > > > > Alex > > > > [1]: > > https://stackoverflow.com/questions/26327957/should-java-8-getters-return-optional-type/26328555#26328555 > > > > > > On Wed, May 21, 2025 at 4:37 PM Eric Maynard > > wrote: > > > > > > Thanks for bringing this up as I’ve been confused by this a few times. > > > > > > Before Polaris I hadn’t really encountered these annotations and I was > > > surprised to learn they don’t “do anything” — that is, there is no > > > additional safety you get at runtime when a null value is passed into a > > > parameter marked non-null. Similarly nothing enforces that you handle > > null > > > values when something is annotated as nullable. > > > > > > For that reason, I tend to assume everything is nullable regardless of > > > annotation and I would be in favor of standardizing around that > > assumption. > > > Iff something is annotated as Non-null a developer should feel safe > > > skipping a check for null, but otherwise they should handle null. > > > > > > I am a big fan of Optional and of trying to reduce the usage of null as > > > much as possible though. > > > > > > On Wed, May 21, 2025 at 3:02 PM Alex Dutra > > > > > wrote: > > > > > > > Hi all, > > > > > > > > A while ago, we had a discussion regarding which nullness annotations > > > > to use and whether we should consider favoring non-null by default. I > > > > would like to revive that discussion. > > > > > > > > We are currently using the `jakarta.annotation` package consistently, > > > > but the defaults are not clear: should we consider everything as > > > > non-null by default and only annotate the nullable things, or the > > > > other way around? Some classes are cluttered with both annotations, > > > > which seems unnecessary and co
Re: [DISCUSS] Create apache/polaris-admin-tool DockerHub repository ?
good point about Renovate! On Wed, May 21, 2025 at 4:49 AM Robert Stupp wrote: > How would tools like dependabot and Renovate distinguish work then? > > I think having separate `-unstable` repos is the safest way forward. > > > On 20.05.25 19:10, Jean-Baptiste Onofré wrote: > > Yes but we can use tag that way “by convention”. > > > > Le mar. 20 mai 2025 à 10:18, Robert Stupp a écrit : > > > > There's no notion of "snapshot" or "nightly" for image tags. > > That's why > > we're pushing for the separate repos. > > > > On 20.05.25 05:18, Jean-Baptiste Onofré wrote: > > > Yes agree. Latest would be a valid tag only for release images. > > > Snapshot images will be just … snapshot tag ;) (not latest). > > > > > > Regards > > > JB > > > > > > Le lun. 19 mai 2025 à 21:40, Dmitri Bourlatchkov > > a > > > écrit : > > > > > >> If we put nightlies in the same repo, we should be careful > > with the > > >> "latest" tag. > > >> > > >> I suppose users will expect "latest" to track only officially > > released > > >> images in that case. > > >> > > >> It might even be worth _not_ using the "latest" tag at all. > > >> > > >> Cheers, > > >> Dmitri. > > >> > > >> On Mon, May 19, 2025 at 3:24 PM Jean-Baptiste Onofré > > > > >> wrote: > > >> > > >>> I'm not super convinced by a repository dedicated to nightly > > or snapshot. > > >>> > > >>> A nightly or a SNAPSHOT is a version/tag of an image. So, I would > > >>> expect to have it in the same repository as the released images. > > >>> > > >>> For instance, you would have apache/polaris:x.y.z and > > >>> apache/polaris:x.y.z-SNAPSHOT > > >>> > > >>> Some projects do that (for instance > > >>> https://hub.docker.com/r/apache/gravitino/tags). > > >>> > > >>> I would propose to have apache/polaris and > > apache/polaris-admin-tool > > >> repos. > > >>> Thoughts ? > > >>> > > >>> Regards > > >>> JB > > >>> > > >>> On Mon, May 19, 2025 at 8:43 PM Dmitri Bourlatchkov > > > > >>> wrote: > > Using a different repo name for nightlies / unstable sounds > > good to me, > > > > Cheers, > > Dmitri. > > > > On Mon, May 19, 2025 at 12:19 PM Alex Dutra > > >>> > > wrote: > > > > > Hi all, > > > > > > To be honest my preference would be Option 4: a different > > image name > > > for everything that is "nightly" or "unstable". For example, > > > "apache/polaris-unstable" or > > "apache/polaris-admin-tool-nightly". > > > > > > My reasoning is simple: make it almost impossible for a user to > > > accidentally deploy an unstable version of the server into > > >> production, > > > by confusing an unstable tag with a production-ready one. > > > > > > Otherwise, I can also live with Option 3, especially if we > > use tags > > > that are "scary" enough to dissuade people from using them in > > > production. > > > > > > Option 1 would be really bad for DevOps workflows: for > > example, I > > > think running Kubernetes Jobs to bootstrap or purge realms > > (or doing > > > other administrative tasks) will become a common practice; > > but in > > >> this > > > case, the tool must be available as a Docker image. > > > > > > Option 2 is also bad: users expect Docker binaries to > > contain the > > >> same > > > "thing" regardless of tags, so it would be confusing to mix > > binaries > > > for the server and the tool under the same image. Not to > > mention that > > > mixing binaries would absolutely preclude the usage of the tag > > > "latest" since we wouldn't know if "latest" contains the > > server or > > >> the > > > tool. > > > > > > Thanks, > > > > > > Alex > > > > > > > > > On Mon, May 19, 2025 at 5:53 PM Robert Stupp > > wrote: > > >> Also +1 on option 3. > > >> > > >> Would also propose to push releases and snapshots to > > separate image > > >> repositories. > > >> > > >> On 19.05.25 17:46, Dmitri Bourlatchkov wrote: > > >>> Option 2 looks very confusing to me. While it can technically > > >>> work, I > > > think > > >>> most people expect the repository name to reflect the > > nature of > > >> the > > > binary, > > >>> so apache/polaris would mean "server" by default. > > >>> > > >>> I prefer option 3. > > >>> > > >>> I also think we should have an image for the admin tool > > because > > >> it > > >>> is > > >>> required for boots
Re: [DISCUSS] Redo the STDOUT batter
Sounds good! Emojis voting is fun! Yufei On Thu, May 22, 2025 at 10:59 AM Russell Spitzer wrote: > Thanks to Scott and Dmitri for two nice design proposals > > https://github.com/apache/polaris/pull/1654 > https://github.com/apache/polaris/pull/1655 > > Quoting what I wrote on 1654 > > How about we wait till 5/28 and link all alternatives in the parent issue > https://github.com/apache/polaris/issues/1653 and we can do approval > voting > with emojis. I would end the vote until Friday 5/30 and merge the winner. > > > On Sun, May 18, 2025 at 6:46 PM Yufei Gu wrote: > > > +1 to refresh it. > > > > First, we should fix the naming. I assume "polaris catalog" was used > before > > the donation to the ASF. Now, we should consistently use "Apache Polaris" > > or simply "Polaris". > > Also, I prefer the style JB suggested, it’s cleaner and makes the > > characters easier to recognize. The current design feels a bit harder to > > parse at a glance for me, maybe just me :) > > > > Yufei > > > > > > On Fri, May 16, 2025 at 10:49 PM Jean-Baptiste Onofré > > wrote: > > > > > Hi Dimitri > > > > > > In projects like Karaf, I'm using https://patorjk.com/software/taag/ > > > to create ASCII Art for banner :) > > > > > > Regards > > > JB > > > > > > On Fri, May 16, 2025 at 9:23 PM Dmitri Bourlatchkov > > > wrote: > > > > > > > > Hi All, > > > > > > > > How do people feel about refreshing the ascii art banner [1] for 1.0? > > > > > > > > Is there any lore behind the iceberg picture? I assume it's an > iceberg > > :) > > > > > > > > Cheers, > > > > Dmitri. > > > > > > > > [1] > > > > > > > > > > https://github.com/apache/polaris/blob/apache-polaris-0.10.0-beta-incubating-rc2/service/common/src/main/resources/org/apache/polaris/service/banner.txt > > > > > >
Re: [DISCUSS] Standardize nullness annotations
My preference is for option 1 below. On Thu, May 22, 2025 at 1:53 PM Alex Dutra wrote: > Hi, > > My proposal was centered around compile-time checks and targets mostly > developers and contributors. I am not questioning the usefulness of > runtime checks when these make sense. > > Maybe an example is better than a thousand words. Let's imagine a > simple getOrDefault() method. Which version do you prefer? > > 1. Annotate only nullable items: > public String getOrDefault(@Nullable String s, String def) { return s > == null ? def : Objects.requireNonNull(def); } > > 2. Annotate only non-null items: > @Nonnull public String getOrDefault(String s, @Nonnull String def) { > return s == null ? def : Objects.requireNonNull(def); } > > 3. Annotate everything: > @Nonnull public String getOrDefault(@Nullable String s, @Nonnull > String def) { return s == null ? def : Objects.requireNonNull(def); } > > Many places in Polaris are using option 3, which is too verbose and > leads to visual fatigue. What I was suggesting to the community is to > adopt option 1, that reduces the visual clutter and also assumes > non-null by default. > > (You will notice that I added a runtime check to all three versions.) > > Hope that helps to clarify the discussion. > > Alex > > On Wed, May 21, 2025 at 8:54 PM Yufei Gu wrote: > > > > > > > > In my opinion, assuming everything is nullable by default isn't the > > > best approach for writing robust code. I believe a bias towards > > > non-nullness leads to more reliable systems. > > > > I agreed with the intention, but am concerned that assuming everything is > > non-nullness may discourage null-checking, which is problematic as > runtime > > null-checking isn't a thing. > > > > Yufei > > > > > > On Wed, May 21, 2025 at 9:52 AM Alex Dutra > > > wrote: > > > > > Hi Eric, > > > > > > You're right that annotations don't change the bytecode at runtime. > > > Their real value comes during compilation, as many static analysis > > > tools use them to flag potential issues. They can even cause build > > > failures depending on how you configure them. My IDE (IntelliJ) > > > frequently warns me when I forget to handle a potential NPE; if you're > > > not seeing similar feedback, it might be worth checking your IDE > > > settings. > > > > > > While the annotations are primarily for compile-time checks, that > > > doesn't mean we can't also incorporate runtime checks. We should aim > > > to include these whenever this makes sense, for example by using > > > Guava's Preconditions. This is especially useful if we can't guarantee > > > that a method parameter, for instance, will never be null, because > > > it's being provided by some external system. > > > > > > In my opinion, assuming everything is nullable by default isn't the > > > best approach for writing robust code. I believe a bias towards > > > non-nullness leads to more reliable systems. > > > > > > I am also a big fan of Optional and think we should strive to use it > > > as much as possible. That said, it's not always possible, especially > > > if you are implementing a third-party interface that doesn't use it. > > > Using Optional in class fields and method parameters is also > > > controversial: Optional was designed primarily as a signal from the > > > callee to the caller, to signify: "no result". In other words, its > > > main purpose is to clarify method return types. This post on Stack > > > Overflow by Brian Goetz is worth reading: [1]. > > > > > > Thanks, > > > > > > Alex > > > > > > [1]: > > > > https://stackoverflow.com/questions/26327957/should-java-8-getters-return-optional-type/26328555#26328555 > > > > > > > > > On Wed, May 21, 2025 at 4:37 PM Eric Maynard > > > > wrote: > > > > > > > > Thanks for bringing this up as I’ve been confused by this a few > times. > > > > > > > > Before Polaris I hadn’t really encountered these annotations and I > was > > > > surprised to learn they don’t “do anything” — that is, there is no > > > > additional safety you get at runtime when a null value is passed > into a > > > > parameter marked non-null. Similarly nothing enforces that you handle > > > null > > > > values when something is annotated as nullable. > > > > > > > > For that reason, I tend to assume everything is nullable regardless > of > > > > annotation and I would be in favor of standardizing around that > > > assumption. > > > > Iff something is annotated as Non-null a developer should feel safe > > > > skipping a check for null, but otherwise they should handle null. > > > > > > > > I am a big fan of Optional and of trying to reduce the usage of null > as > > > > much as possible though. > > > > > > > > On Wed, May 21, 2025 at 3:02 PM Alex Dutra > > > > > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > A while ago, we had a discussion regarding which nullness > annotations > > > > > to use and whether we should consider favoring non-null by > default. I > > > > > would like to revive that discussion. > > > > >
Re: [DISCUSS] Redo the STDOUT batter
Thanks to Scott and Dmitri for two nice design proposals https://github.com/apache/polaris/pull/1654 https://github.com/apache/polaris/pull/1655 Quoting what I wrote on 1654 How about we wait till 5/28 and link all alternatives in the parent issue https://github.com/apache/polaris/issues/1653 and we can do approval voting with emojis. I would end the vote until Friday 5/30 and merge the winner. On Sun, May 18, 2025 at 6:46 PM Yufei Gu wrote: > +1 to refresh it. > > First, we should fix the naming. I assume "polaris catalog" was used before > the donation to the ASF. Now, we should consistently use "Apache Polaris" > or simply "Polaris". > Also, I prefer the style JB suggested, it’s cleaner and makes the > characters easier to recognize. The current design feels a bit harder to > parse at a glance for me, maybe just me :) > > Yufei > > > On Fri, May 16, 2025 at 10:49 PM Jean-Baptiste Onofré > wrote: > > > Hi Dimitri > > > > In projects like Karaf, I'm using https://patorjk.com/software/taag/ > > to create ASCII Art for banner :) > > > > Regards > > JB > > > > On Fri, May 16, 2025 at 9:23 PM Dmitri Bourlatchkov > > wrote: > > > > > > Hi All, > > > > > > How do people feel about refreshing the ascii art banner [1] for 1.0? > > > > > > Is there any lore behind the iceberg picture? I assume it's an iceberg > :) > > > > > > Cheers, > > > Dmitri. > > > > > > [1] > > > > > > https://github.com/apache/polaris/blob/apache-polaris-0.10.0-beta-incubating-rc2/service/common/src/main/resources/org/apache/polaris/service/banner.txt > > >
Re: [DISCUSS] Remove realm_id metric tag
Hi again, Let's go back to the main concern: excessive memory required for metrics. To move this discussion forward, would you all agree to introduce a flag that activates the inclusion of realm IDs in the metrics? It seems like this could be a good middle ground that addresses everyone's needs. Let me know your thoughts. Alex On Wed, May 21, 2025 at 3:57 PM Robert Stupp wrote: > > Ouch! That sounds very simple to exploit and quite serious. > > On 21.05.25 15:38, Alex Dutra wrote: > > Also worth noting: http_server_requests_seconds_* metrics are generated > > even for failed requests. So it would be very easy for an attacker to forge > > HTTP requests containing invalid realms and take the server down. > > > > Alex > > > > On Wed, May 21, 2025 at 3:04 PM Robert Stupp wrote: > > > >> OOMs are bad - it's extremely hard (if not impossible) for any user to > >> figure out why that happens. Bug reports would just read like "Polaris > >> runs into an OOM - no further information available". > >> > >> IMHO the system should be stable by default, not the other way around. > >> Even if there is a way to enable such huge cardinalities, that flag > >> should really be hidden and not documented anywhere - unless there's a > >> clear disclaimer saying: "Enabling this can lead to an unresponsive > >> system and risk the stability of the system. ..." > >> > >> I propose to investigate all metrics and reduce the cardinality of all > >> of those as that can easily become a serious issue. > >> > >> On 21.05.25 11:42, Alex Dutra wrote: > >>> Hi Mike, > >>> > >>> If you have around a hundred realms or more, imho you are already in > >>> trouble. > >>> > >>> The most problematic metric is the http_server_requests_seconds_* metrics > >>> family, e.g.: > >>> > >>> > >> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} > >>> 1.0 > >>> > >>> Since metrics in this family already have a high cardinality potential > >>> given the number of tags it supports by default, adding one more > >> dimension > >>> to them makes things (exponentially) worse. > >>> > >>> It's very easy to demonstrate that by running a small test [1]. On my > >>> machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd > >>> iteration (1000 realms) runs for about 1 minute then ends up in > >>> java.lang.OutOfMemoryError: Java heap space. > >>> > >>> That's why I advocated for removing the tag. If however you really want > >> to > >>> keep it, I'd suggest introducing a configuration flag to disable it in > >> two > >>> problematic metric families: the HTTP one shown above, and the > >> per-endpoint > >>> metrics as well. > >>> > >>> Thanks, > >>> > >>> Alex > >>> > >>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d > >>> > >>> > >>> > >>> On Wed, May 21, 2025 at 7:35 AM Michael Collado > >>> wrote: > >>> > Hmm, we do use the realm tag in our metric publishing. I understand the > concern re: cardinality. Maybe we can support filtering metrics that > >> have > realm and support another metric without realm? > > On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov > wrote: > > > Removing realm_id from metrics tags makes sense to me (to avoid high > > cardinality). > > > > If we need to have insight into load differences from realm to realm, > >> it > > might be preferable to introduce metrics dedicated to that rather than > > increasing the cardinality of every endpoint metric. > > > > Cheers, > > Dmitri. > > > > On Thu, May 15, 2025 at 3:30 PM Alex Dutra > >> > wrote: > > > >> Hi all, > >> > >> I would like to suggest removing the "realm_id" metric tag entirely. > >> > >> My concern is that this tag has the potential for high cardinality, > which > >> is generally considered a bad practice when dealing with metrics. High > >> cardinality can lead to performance issues and increased memory usage. > >> > >> Granted, the default realm resolver in Polaris is tailored for just a > >> handful of realms, but nothing prevents users from declaring hundreds > of > >> realms. > >> > >> I believe we can still effectively monitor Polaris servers without > >> this > >> specific tag, since the realm ID is also propagated in traces emitted > by > >> Polaris. Tracing is a much better fit for high-cardinality domains. > >> > >> I'm open to discussing this further; a potential alternative would be > to > >> introduce a flag to disable this specific metric tag, but I feel like > >> removing it would be a much cleaner approach. > >> > >> Let me know your thoughts. > >> > >> Thanks, > >> > >> Alex > >> > >> -- > >> Robert Stupp > >> @snazy > >> > >> > -- > Robert Stupp > @snazy >
Re: Proposal for Implementation of the Proposed Iceberg REST Spec - Events API
I haven’t thought of this in depth honestly - but I could see this being the case. -Adnan > On May 21, 2025, at 4:39 PM, Eric Maynard wrote: > > -devlist > > If we design by interface properly, it should be relatively easy to offer > both a disk buffering and an always-write implementation right? > > On Thu, May 22, 2025 at 12:12 AM Adnan Hemani > wrote: > >> Hi all, >> >> Thanks for sharing these thoughts. I’m also not completely sure about how >> much we should care about how much slower things will be if we just make a >> trip to the persistence on every write action. However, I’m building this >> feature with the intention of being able to also support read event types >> in the near future, if this is something that the customer is interested in >> enabling using the `CustomOperation` type that is defined in the Events API >> spec. Of course, this would need to be configured by the administrator, as >> maintenance of the persistence is their responsibility. >> >> Given that the Iceberg Events API spec has not yet merged and can still >> see some changes, I’m planning to begin work on the disk buffering now and >> wait for the Events API to finalize before working on the API side of the >> end-to-end implementation. >> >> Best, >> Adnan Hemani >> >>> On May 20, 2025, at 10:30 PM, Michael Collado >> wrote: >>> >>> That’s super interesting. Glad this is being worked on. Personally, I >> don’t >>> know that the latency for writing events to a persistent storage is >> really >>> all that concerning. Looking at the enum of supported operations, only >>> write operations seem to trigger the event. It’s not like every read >>> request issues a new event. Given that the request latency here is >>> dominated by cloud storage calls, do we really care about one extra call >> to >>> Postgres? Personally, I’d skip the extra complexity of a buffer of any >> kind >>> and just write straight to the persistence store. >>> >>> Mike >>> >>> On Tue, May 20, 2025 at 9:31 AM Yufei Gu > flyrain...@gmail.com>> wrote: >>> Looks awesome. Thanks for taking the lead! It makes sense to use a JDBC-backed persistence layer, shared or separate. The optional >> retention period is a nice safeguard. I don’t see any blockers on my side. If no one raises major concerns >> this week, please go ahead and start the implementation. Exciting to see this coming together! Yufei On Tue, May 13, 2025 at 6:37 PM Adnan Hemani wrote: > Hi all, > > I am raising a proposal to implement the proposed Iceberg REST > specification for the Events API (doc < > >> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://docs.google.com/document/d/1WtIsNGVX75-_MsQIOJhXLAWg6IbplV4-DkLllQEiFT8/edit?pli%253D1%2526tab%253Dt.0%26source%3Dgmail-imap%26ust%3D174841032700%26usg%3DAOvVaw0UNQYKbXoQ2YHVM7J0kB3l&source=gmail-imap&ust=174847561300&usg=AOvVaw38VwFquueiVwCdlo_xR2rB > , > GH < >> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/apache/iceberg/pull/12584/files%26source%3Dgmail-imap%26ust%3D174841032700%26usg%3DAOvVaw1AvwLK402voAm_j6zy25Mn&source=gmail-imap&ust=174847561300&usg=AOvVaw2IwxUir8jCegcKC47zv1Si>). >> It is my > understanding that this proposal is close and that we will be required >> to > implement something very close to the current proposal in the near future. > > If Polaris is to implement this API, it will likely need to be through >> a > Persistence instance that the Polaris server can query instantly, as >> this > API will not be asynchronous. Please note, this proposal is not to comment > on what events we may emit today or in the future - the scope of this > proposal is solely to discuss how we plan to implement the proposed Events > API. > > Changes to be made: > > Implement Event storage through the Polaris Persistence layer > > We will store events in a persistence instance of user’s choice - >> whether > they would like the events to be part of the same persistence instance >> as > their Polaris metadata or if they would like for a separate persistence > instance. Users will provide the persistence instance by configuring a JDBC > string on Polaris startup, similarly to the JDBC string we receive > currently from users for the Polaris metadata. > > For concerns regarding scaling of events in the Polaris persistence layer, > we can also implement a recommended, optional parameter for an events > retention period after which Polaris will asynchronously delete records > older than that time period. > > How to Implement Writes to the Polaris Persistence layer > > The way to implement the above proposal would be through implementation of > the `PolarisEventListener` < > >> https://www.google.com/url
Re: [Discuss] Add `location` to generic table spec
> This is a stricter requirement than we have for Iceberg tables. Are we really going to enforce this? How will we do it efficiently? If not, let's not put it in the spec. The efficiency is a good point, if we are supporting arbitrary nested namespaces, the efficiency is definitely a concern. Maybe we can restrict that for generic tables, but I think it would be good for us to stay consistent with Iceberg tables on this, since we share the namespace concept. We can exclude this from the spec. However, I do think that is the right restriction to put for both Iceberg and generic tables for better security guarantee, maybe we can do a separate discussion on this topic. >It would be trivial to add update support for generic entities. Why canonicalize this restriction in the spec? We don't, for example, currently detail a restriction around the fact that you can't change a generic table's format. Sure, we don't have to mention this in the Spec. > generic tables are a catch-all type not specific to any format (including Iceberg) Generic Table APis today have a clear separation with Iceberg table APIs. I don't think we want to close the door for that, and that is also why I think "generic" is a good name. However, if want to move on to include certain semantics for iceberg tables, for example, showing iceberg tables in list tables, there will be a repurpose of the API endpoints, and I think it would be more proper to move on for V2 spec, because people will have to use those APIs differently. > GenericTableEntity is the type I'm most likely to look to for the conversion service, which means it will indeed be used to represent Iceberg tables. For conversion, if we are converting a table to an iceberg table, and the table only has one root location, the target iceberg table will also have one root location, so I don't see a problem with this. If we are converting from an iceberg table to a target format that only supports one location, I don't see a problem also. Even with Iceberg table spec today, I believe the locations it has are : top level location, metadata.path, and data.path. I don't think that can be achieved with an array of locations also, Because it can not tell which path is for metadata, which path is for data, I don't think relying on the size and position of an array is a good idea, and that extra path information can continue be represented with generic tables using properties and top level location. Even with all those location configurations, I don't think Iceberg spec is capturing all locations a table can have, because every snapshot can potentially write into a different location, and those are not tracked anywhere by anyone today. Furthermore it might require information more than just a location, for example, it might need to be associated with the snapshot. I know Dennis was discussing a multi-location spec for Iceberg, but the information needed seems more complicated than just a list of locations. Table with multiple location support seems a bigger topic that requires much more thought to me, again I am not saying we shouldn't support it in the future, but I think we should put more thought into how tables with multiple locations work before we start supporting those. > The multi-location support in Polaris seems not very well also, the overlap check and credential vending seems all done with one location Sorry, i think i misread the caller of the code for the overlap check. Dennis mentioned that we only use one location for credential, but it might be for something else. Best Regards, Yun On Thu, May 22, 2025 at 3:08 PM Eric Maynard wrote: > > i meant no two tables under the same catalog can have the same location > > This is a stricter requirement than we have for Iceberg tables. Are we > really going to enforce this? How will we do it efficiently? If not, let's > not put it in the spec. > > > we do not have any update support > > It would be trivial to add update support for generic entities. Why > canonicalize this restriction in the spec? We don't, for example, currently > detail a restriction around the fact that you can't change a generic > table's format. > > > generic tables are designed for non-Iceberg tables today, > > I don't actually think this is true. There's nothing about generic tables > that make them more useful for Delta tables than Iceberg tables, for > example. On the contrary, I initially proposed the name "generic" in part > to capture that generic tables are a catch-all type not specific to any > format (including Iceberg). More practically, GenericTableEntity is the > type I'm most likely to look to for the conversion service, which means it > will indeed be used to represent Iceberg tables. > > > The multi-location support in Polaris seems not very well also, the > overlap check and credential vending seems all done with one location > > This is not true. >
Re: [DISCUSS] Remove realm_id metric tag
My immediate reaction is that a configuration flag would be good enough for us. Our setup prevents requests without a valid realm, so we're not subject to the DDOS attack you mentioned, but we should address that in the default realm context resolver as well. I can do some further investigation to see if it's something we could get rid of entirely (we also use log-based metrics), but I like the slow approach of adding a configuration flag, disabling it by default, then removing functionality as a best practice, anyway. Mike On Thu, May 22, 2025 at 11:39 AM Alex Dutra wrote: > Hi again, > > Let's go back to the main concern: excessive memory required for metrics. > > To move this discussion forward, would you all agree to introduce a > flag that activates the inclusion of realm IDs in the metrics? > > It seems like this could be a good middle ground that addresses > everyone's needs. > > Let me know your thoughts. > > Alex > > > On Wed, May 21, 2025 at 3:57 PM Robert Stupp wrote: > > > > Ouch! That sounds very simple to exploit and quite serious. > > > > On 21.05.25 15:38, Alex Dutra wrote: > > > Also worth noting: http_server_requests_seconds_* metrics are generated > > > even for failed requests. So it would be very easy for an attacker to > forge > > > HTTP requests containing invalid realms and take the server down. > > > > > > Alex > > > > > > On Wed, May 21, 2025 at 3:04 PM Robert Stupp wrote: > > > > > >> OOMs are bad - it's extremely hard (if not impossible) for any user to > > >> figure out why that happens. Bug reports would just read like "Polaris > > >> runs into an OOM - no further information available". > > >> > > >> IMHO the system should be stable by default, not the other way around. > > >> Even if there is a way to enable such huge cardinalities, that flag > > >> should really be hidden and not documented anywhere - unless there's a > > >> clear disclaimer saying: "Enabling this can lead to an unresponsive > > >> system and risk the stability of the system. ..." > > >> > > >> I propose to investigate all metrics and reduce the cardinality of all > > >> of those as that can easily become a serious issue. > > >> > > >> On 21.05.25 11:42, Alex Dutra wrote: > > >>> Hi Mike, > > >>> > > >>> If you have around a hundred realms or more, imho you are already in > > >>> trouble. > > >>> > > >>> The most problematic metric is the http_server_requests_seconds_* > metrics > > >>> family, e.g.: > > >>> > > >>> > > >> > http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} > > >>> 1.0 > > >>> > > >>> Since metrics in this family already have a high cardinality > potential > > >>> given the number of tags it supports by default, adding one more > > >> dimension > > >>> to them makes things (exponentially) worse. > > >>> > > >>> It's very easy to demonstrate that by running a small test [1]. On my > > >>> machine, the first 2 iterations (10 and 100 realms) complete, but > the 3rd > > >>> iteration (1000 realms) runs for about 1 minute then ends up in > > >>> java.lang.OutOfMemoryError: Java heap space. > > >>> > > >>> That's why I advocated for removing the tag. If however you really > want > > >> to > > >>> keep it, I'd suggest introducing a configuration flag to disable it > in > > >> two > > >>> problematic metric families: the HTTP one shown above, and the > > >> per-endpoint > > >>> metrics as well. > > >>> > > >>> Thanks, > > >>> > > >>> Alex > > >>> > > >>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d > > >>> > > >>> > > >>> > > >>> On Wed, May 21, 2025 at 7:35 AM Michael Collado < > collado.m...@gmail.com> > > >>> wrote: > > >>> > > Hmm, we do use the realm tag in our metric publishing. I understand > the > > concern re: cardinality. Maybe we can support filtering metrics that > > >> have > > realm and support another metric without realm? > > > > On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov < > di...@apache.org> > > wrote: > > > > > Removing realm_id from metrics tags makes sense to me (to avoid > high > > > cardinality). > > > > > > If we need to have insight into load differences from realm to > realm, > > >> it > > > might be preferable to introduce metrics dedicated to that rather > than > > > increasing the cardinality of every endpoint metric. > > > > > > Cheers, > > > Dmitri. > > > > > > On Thu, May 15, 2025 at 3:30 PM Alex Dutra > > >> > > wrote: > > > > > >> Hi all, > > >> > > >> I would like to suggest removing the "realm_id" metric tag > entirely. > > >> > > >> My concern is that this tag has the potential for high > cardinality, > > which > > >> is generally considered a bad practice when dealing with metrics. > High > > >> cardinality can lead to performance issues and increased memory > usage. > > >> > > >> Granted, the default re
Re: [DISCUSS] Create apache/polaris-admin-tool DockerHub repository ?
Personally, I’m unsure about what good a nightly Docker image build helps with. If we’re backing up to the same Docker Hub repository, I think it could make some sense as long as we manage the “latest” tag properly to only ever tag releases since the proposed effort and maintenance overhead is low - but if we are going through the effort to create a whole new nightly/snapshot repo, what does this help us achieve? Do we see a proper use case for customers to ever really need that? On the question that started this thread about the Polaris Admin Tool, I’m onboard with Option 3 - let’s separate out the Admin Tool and the Server. We should be careful to ensure that these move together in terms of versioning - but I don’t see that as a big enough concern to refute this option. Best, Adnan Hemani > On May 22, 2025, at 12:12 PM, Dmitri Bourlatchkov wrote: > > good point about Renovate! > > On Wed, May 21, 2025 at 4:49 AM Robert Stupp wrote: > >> How would tools like dependabot and Renovate distinguish work then? >> >> I think having separate `-unstable` repos is the safest way forward. >> >> >> On 20.05.25 19:10, Jean-Baptiste Onofré wrote: >>> Yes but we can use tag that way “by convention”. >>> >>> Le mar. 20 mai 2025 à 10:18, Robert Stupp a écrit : >>> >>>There's no notion of "snapshot" or "nightly" for image tags. >>>That's why >>>we're pushing for the separate repos. >>> >>>On 20.05.25 05:18, Jean-Baptiste Onofré wrote: Yes agree. Latest would be a valid tag only for release images. Snapshot images will be just … snapshot tag ;) (not latest). Regards JB Le lun. 19 mai 2025 à 21:40, Dmitri Bourlatchkov >>> a écrit : > If we put nightlies in the same repo, we should be careful >>>with the > "latest" tag. > > I suppose users will expect "latest" to track only officially >>>released > images in that case. > > It might even be worth _not_ using the "latest" tag at all. > > Cheers, > Dmitri. > > On Mon, May 19, 2025 at 3:24 PM Jean-Baptiste Onofré >>> > wrote: > >> I'm not super convinced by a repository dedicated to nightly >>>or snapshot. >> >> A nightly or a SNAPSHOT is a version/tag of an image. So, I would >> expect to have it in the same repository as the released images. >> >> For instance, you would have apache/polaris:x.y.z and >> apache/polaris:x.y.z-SNAPSHOT >> >> Some projects do that (for instance >> https://www.google.com/url?q=https://hub.docker.com/r/apache/gravitino/tags&source=gmail-imap&ust=174854596300&usg=AOvVaw32_izMPTTmU8H9LY8DFkqU). >> >> I would propose to have apache/polaris and >>>apache/polaris-admin-tool > repos. >> Thoughts ? >> >> Regards >> JB >> >> On Mon, May 19, 2025 at 8:43 PM Dmitri Bourlatchkov >>> >> wrote: >>> Using a different repo name for nightlies / unstable sounds >>>good to me, >>> >>> Cheers, >>> Dmitri. >>> >>> On Mon, May 19, 2025 at 12:19 PM Alex Dutra >> >>> wrote: >>> Hi all, To be honest my preference would be Option 4: a different >>>image name for everything that is "nightly" or "unstable". For example, "apache/polaris-unstable" or >>>"apache/polaris-admin-tool-nightly". My reasoning is simple: make it almost impossible for a user to accidentally deploy an unstable version of the server into > production, by confusing an unstable tag with a production-ready one. Otherwise, I can also live with Option 3, especially if we >>>use tags that are "scary" enough to dissuade people from using them in production. Option 1 would be really bad for DevOps workflows: for >>>example, I think running Kubernetes Jobs to bootstrap or purge realms >>>(or doing other administrative tasks) will become a common practice; >>>but in > this case, the tool must be available as a Docker image. Option 2 is also bad: users expect Docker binaries to >>>contain the > same "thing" regardless of tags, so it would be confusing to mix >>>binaries for the server and the tool under the same image. Not to >>>mention that mixing binaries would absolutely preclude the usage of the tag "latest" since we wouldn't know if "latest" contains the >>>server or > the tool. Thanks, Alex On Mon, May 19, 2025 at 5:53 PM Robert Stupp >>> wrote: > Also +1 on option 3. > > Would also propose to push releases and snapshots to >>>separate image > repositories. > > On 19.05.25 17:46, Dmitri Bourlatchkov wr
Re: [DISCUSS] Create apache/polaris-admin-tool DockerHub repository ?
Hi Let's split the discussion in two parts. This thread is about polaris-admin-tool DockerHub repo. We have a consensus on option 3, so I will create a DockerHub repo for admin-tool. I will start a separate thread about nightly images. Thanks everyone! Regards JB On Fri, May 23, 2025 at 2:34 AM Adnan Hemani wrote: > > Personally, I’m unsure about what good a nightly Docker image build helps > with. If we’re backing up to the same Docker Hub repository, I think it could > make some sense as long as we manage the “latest” tag properly to only ever > tag releases since the proposed effort and maintenance overhead is low - but > if we are going through the effort to create a whole new nightly/snapshot > repo, what does this help us achieve? Do we see a proper use case for > customers to ever really need that? > > On the question that started this thread about the Polaris Admin Tool, I’m > onboard with Option 3 - let’s separate out the Admin Tool and the Server. We > should be careful to ensure that these move together in terms of versioning - > but I don’t see that as a big enough concern to refute this option. > > Best, > Adnan Hemani > > > On May 22, 2025, at 12:12 PM, Dmitri Bourlatchkov wrote: > > > > good point about Renovate! > > > > On Wed, May 21, 2025 at 4:49 AM Robert Stupp wrote: > > > >> How would tools like dependabot and Renovate distinguish work then? > >> > >> I think having separate `-unstable` repos is the safest way forward. > >> > >> > >> On 20.05.25 19:10, Jean-Baptiste Onofré wrote: > >>> Yes but we can use tag that way “by convention”. > >>> > >>> Le mar. 20 mai 2025 à 10:18, Robert Stupp a écrit : > >>> > >>>There's no notion of "snapshot" or "nightly" for image tags. > >>>That's why > >>>we're pushing for the separate repos. > >>> > >>>On 20.05.25 05:18, Jean-Baptiste Onofré wrote: > Yes agree. Latest would be a valid tag only for release images. > Snapshot images will be just … snapshot tag ;) (not latest). > > Regards > JB > > Le lun. 19 mai 2025 à 21:40, Dmitri Bourlatchkov > >>> a > écrit : > > > If we put nightlies in the same repo, we should be careful > >>>with the > > "latest" tag. > > > > I suppose users will expect "latest" to track only officially > >>>released > > images in that case. > > > > It might even be worth _not_ using the "latest" tag at all. > > > > Cheers, > > Dmitri. > > > > On Mon, May 19, 2025 at 3:24 PM Jean-Baptiste Onofré > >>> > > wrote: > > > >> I'm not super convinced by a repository dedicated to nightly > >>>or snapshot. > >> > >> A nightly or a SNAPSHOT is a version/tag of an image. So, I would > >> expect to have it in the same repository as the released images. > >> > >> For instance, you would have apache/polaris:x.y.z and > >> apache/polaris:x.y.z-SNAPSHOT > >> > >> Some projects do that (for instance > >> https://www.google.com/url?q=https://hub.docker.com/r/apache/gravitino/tags&source=gmail-imap&ust=174854596300&usg=AOvVaw32_izMPTTmU8H9LY8DFkqU). > >> > >> I would propose to have apache/polaris and > >>>apache/polaris-admin-tool > > repos. > >> Thoughts ? > >> > >> Regards > >> JB > >> > >> On Mon, May 19, 2025 at 8:43 PM Dmitri Bourlatchkov > >>> > >> wrote: > >>> Using a different repo name for nightlies / unstable sounds > >>>good to me, > >>> > >>> Cheers, > >>> Dmitri. > >>> > >>> On Mon, May 19, 2025 at 12:19 PM Alex Dutra > >> > >>> wrote: > >>> > Hi all, > > To be honest my preference would be Option 4: a different > >>>image name > for everything that is "nightly" or "unstable". For example, > "apache/polaris-unstable" or > >>>"apache/polaris-admin-tool-nightly". > > My reasoning is simple: make it almost impossible for a user to > accidentally deploy an unstable version of the server into > > production, > by confusing an unstable tag with a production-ready one. > > Otherwise, I can also live with Option 3, especially if we > >>>use tags > that are "scary" enough to dissuade people from using them in > production. > > Option 1 would be really bad for DevOps workflows: for > >>>example, I > think running Kubernetes Jobs to bootstrap or purge realms > >>>(or doing > other administrative tasks) will become a common practice; > >>>but in > > this > case, the tool must be available as a Docker image. > > Option 2 is also bad: users expect Docker binaries to > >>>contain the > > same > "thing" regardless of tags, so it would be confusing to mix > >>>binaries > for the server and the tool under the same image. Not to > >
Re: [Discuss] Add `location` to generic table spec
Hi Eric, "No two tables globally can have a location overlap?" -- let me correct that, i meant no two tables under the same catalog can have the same location, which I think should be the right thing to do. Otherwise, we might have a vended credential for a table that can be used to access multiple tables under the same catalog. "restriction that you can’t change location is stricter than what we have for Iceberg". This is a current limitation of our Generic table support since we do not have any update support. This restriction can be removed with proper support later. As for multiple locations, since generic tables are designed for non-Iceberg tables today, supporting multiple locations seems unnecessary at this moment. As Yufei mentioned, I think we should start simple and evolve with use cases. The multi-location support in Polaris seems not very well also, the overlap check and credential vending seems all done with one location. Once we have good support for Iceberg, I think it would be very easy for us to generalize it to other table formats if necessary. Best Regards, Yun On Thu, May 22, 2025 at 9:30 AM Yufei Gu wrote: > Inlined. > > On Thu, May 22, 2025 at 7:48 AM Dmitri Bourlatchkov > wrote: > > > > Can we keep it simple for v1 [...] > > > > What is v1 in this context? > > > I meant as the first iteration, sorry for the confusion. > > > > > Thanks, > > Dmitri. > > > > On Wed, May 21, 2025 at 8:42 PM Yufei Gu wrote: > > > > > Can we keep it simple for v1, as one location field is enough for > today’s > > > use cases? And we can revisit multi-location support when there’s real > > > demand. > > > > > > The current API spec already implies that a table’s location is > > immutable, > > > there’s no “alter location” call. I’m fine leaving it implicit, but we > > > could add an explicit note to make that clear if it helps avoid > > confusion. > > > > > > Yufei > > > > > > > > > On Wed, May 21, 2025 at 4:36 PM Eric Maynard > > > > wrote: > > > > > > > No two tables globally can have a location overlap? That’s a stricter > > > > requirement than we have for even Iceberg tables and doesn’t sound > > > correct. > > > > > > > > Similarly, the restriction that you can’t change location is stricter > > > than > > > > what we have for Iceberg. > > > > > > > > Finally, I’m still not sure what the problem is with having multiple > > > > locations. Again, we already track multiple locations for Iceberg. > > > > > > > > On Thu, May 22, 2025 at 12:32 AM yun zou > > > > > wrote: > > > > > > > > > Hi All, > > > > > > > > > > Want to summarize the thread here: > > > > > > > > > > For generic tables, we will add a `location` key to help cross > engine > > > > > sharing and future support for credential vending. > > > > > > > > > > Here is a description about this `location` key and corresponding > > > > > restrictions and responsibilities: > > > > > - `location`(OPTIONAL): table root location in URI format. For > > example: > > > > > s3:///path/to/table. > > > > > - The table root location is a location that includes all files > for > > > the > > > > > table. > > > > > - Clients (engines) are responsible to make sure all files are > > > written > > > > > under the configured location. > > > > > - A table with multiple root locations (i.e. containing files > that > > > are > > > > > outside the configured root location) is not compliant with the > > current > > > > > generic table support in Polaris. > > > > > - No two tables can have the same or overlapped location, > > otherwise, > > > a > > > > > ForbiddenException will be thrown on creation. > > > > > - If no location is provided, clients or users are responsible to > > > > manage > > > > > the location and location related concerns such as path conflict > > check > > > > etc. > > > > > - The location configuration can not be updated once the table is > > > > > created. > > > > > > > > > > This description will be added into the spec. In order to help > > non-API > > > > > users to discover the information easily, we will also get a site > > page > > > to > > > > > describe the support > > > > > for Generic Table and key fields. > > > > > > > > > > Best Regards, > > > > > Yun > > > > > > > > > > On Mon, May 19, 2025 at 11:16 PM yun zou < > yunzou.colost...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > Hi Dmitri, > > > > > > > > > > > > " I do not think those doc comments provide enough visibility to > > > ensure > > > > > > that the key information > > > > > > is received by users, unless they are dealing directly with the > > API" > > > > > > -- Yeah, I agree those information may not be visible enough for > > > users > > > > > who > > > > > > don't directly work with APIs. > > > > > > However, I think just having one page for "location" might be a > > > little > > > > > bit > > > > > > overkill. Given that generic table API support is > > > > > > a new catalog capabilities that Polaris added which is not IRC, I > > > think > > > >
Re: [Discuss] Add `location` to generic table spec
> i meant no two tables under the same catalog can have the same location This is a stricter requirement than we have for Iceberg tables. Are we really going to enforce this? How will we do it efficiently? If not, let's not put it in the spec. > we do not have any update support It would be trivial to add update support for generic entities. Why canonicalize this restriction in the spec? We don't, for example, currently detail a restriction around the fact that you can't change a generic table's format. > generic tables are designed for non-Iceberg tables today, I don't actually think this is true. There's nothing about generic tables that make them more useful for Delta tables than Iceberg tables, for example. On the contrary, I initially proposed the name "generic" in part to capture that generic tables are a catch-all type not specific to any format (including Iceberg). More practically, GenericTableEntity is the type I'm most likely to look to for the conversion service, which means it will indeed be used to represent Iceberg tables. > The multi-location support in Polaris seems not very well also, the overlap check and credential vending seems all done with one location This is not true.