Just some thoughts about "SELECT vs DESCRIBE": If a catalog can distinguish these privileges, it can opt to return the manifest list pointer only, if the caller has the SELECT privilege.

For INSERT/UPDATE/DELETE/TRUNCATE - well, that is really tricky for the reasons how writes happen in Iceberg. Especially for DELETEs, which can be a "delete files" + "write new files" or "just" appending delete-files (merge on read). It becomes even trickier if the engine does not use SQL but for example "raw" Spark operations. I've got no real idea how to map those to a SQL oriented privilege model.

Eventually there's no way around "trust" between the engine and the catalog. Establishing "trust" in a secure way is not that easy IMO.


On 02.07.24 06:30, Jack Ye wrote:
Thanks Dennis for the detailed analysis and suggestions! Here are a few questions and comments I have:

> Consider expanding the set of privilege definitions to be type-specific

I like this! It seems like it solves the problem about inheritance and future grants as you said. I will think a bit more about it, update the doc accordingly, and see what others think.

> we could introduce separate privileges TABLE_READ_DATA vs TABLE_READ_PROPERTIES

In my definition in the doc, anything above table's data files is considered metadata, and TABLE_DESCRIBE governs all the access. There could be more fine-grained DESCRIBE that could be introduced, like TABLE_DESCRIBE_PROPERTIES, TABLE_DESCRIBE_HISTORY,  TABLE_DESCRIBE_PARTITION. But once we get into that level, things might start to overlap. What if the user has TABLE_DESCRIBE_MANIFEST, but not TABLE_DESCRIBE_PARTITION? Do we show partial information about the manifest and remove partition information? I don't have a good solution to that yet, what do you think?

> since "loadTable" is what the Catalog server sees, but then the engine could be satisfied with just the JSON metadata or might be intending to just crack open manifest files to select some aggregate statistics, or might be going all the way to Parquet files.

My personal solution to this is to add a request context, which was prototyped in https://github.com/apache/iceberg/pull/10359. With this, an engine can describe the privileges needed when requesting table metadata. The prerequisite is that the catalog trusts the information passed by the engine through some authZ mechanism, and the engine uses the defined privileges here in the context. For example, if the engine requests table metadata for a DELETE, then the request will loadTable(table_name, context={privilege=DELETE}). Would that be something feasible to solve the concern?

> mapping INSERT/DELETE/UPDATE all to TABLE_WRITE_DATA since at least for now, from the Catalog's perspective, any deletes require being able to write new manifests, and anything that can do inserts by writing new manifests can also effectively "delete" data in the newest snapshot.

Yes I agree the privileges to insert, delete and update seems redundant given the writer can commit whatever manifest list eventually. I think some systems have a similar concept of just MODIFY privilege.

But what if it is used under the fine-grained metadata commit proposal? (https://docs.google.com/document/d/1OG68EtPxLWvNBJACQwcMrRYuGJCnQas8_LSruTRcHG8/edit) Then in that case an insert would result in a different action type in UpdateTable compared to update and delete. It seems like we should try to reach a consensus on the general direction of this proposal first.

-Jack






On Fri, Jun 28, 2024 at 8:53 PM Dennis Huo <huoi...@gmail.com> wrote:

    +1, Thanks Jack and team for getting the discussion started with
    this proposal!

    Much of this is well aligned with what we noticed when
    implementing RBAC for Polaris Catalog, namely that even if a more
    complicated User/Role structure exists outside of the catalog,
    that it's necessary to be able to express some common building
    blocks around "grantee" roles/principals and scoping/definitions
    of grants/privileges to make RBAC enforcement work well and be
    more standardized across engines.

    Your suggestions about initially trying to avoid known problems
    with things like "OWNER" privileges and problems depending on the
    "grantor" in grant records definitely seem like good ideas.

    One thing that came up when trying to distill catalog-enforceable
    privileges in Polaris was that by the nature of Iceberg's metadata
    model, traditional SQL-style privileges ran into rough edges when
    it came to distinguishing e.g. SELECT vs DESCRIBE, or UPDATE vs
    INSERT vs DELETE, since "loadTable" is what the Catalog server
    sees, but then the engine could be satisfied with just the JSON
    metadata or might be intending to just crack open manifest files
    to select some aggregate statistics, or might be going all the way
    to Parquet files.

    One way to address this is if we're willing to make privilege
    definitions more closely reflect the implementation semantics,
    e.g. mapping INSERT/DELETE/UPDATE all to TABLE_WRITE_DATA since at
    least for now, from the Catalog's perspective, any deletes require
    being able to write new manifests, and anything that can do
    inserts by writing new manifests can also effectively "delete"
    data in the newest snapshot.

    It also seems like there's a relationship between having more
    type-specific privileges, the ability to have unambiguous
    hierarchical grants (e.g. granting TABLE_READ_DATA on a namespace
    to inherit the privilege in all child tables), and also having a
    way to express storage-credential-vending privileges under the
    same model.

    A few suggestions relating to this:

      * Consider expanding the set of privilege definitions to be
        type-specific (beyond inferring the type-privilege from the
        object on which a privilege is granted). Maybe there should
        still be a common convention for all the "pure CRUDL"
        operations, but then types might have some additional
        type-specific privileges too
          o Example: NAMESPACE_CREATE, NAMESPACE_READ_PROPERTIES,
            NAMESPACE_WRITE_PROPERTIES, NAMESPACE_DROP,  NAMESPACE_LIST
      * Allow/define a convention for inheriting grants in the
        securable object hierarchy -- though it makes sense to also
        allow for non-inheritance if an implementation wants to keep
        the model simple, if we do have type-specific privileges, it
        at least mitigates one of the listed concerns about accidental
        privileges.
          o For example, if the privilege is only DESCRIBE, then
            granting DESCRIBE on a namespace isn't clear whether it
            should also confer DESCRIBE on tables/views underneath it.
            But we could say NAMESPACE_READ_PROPERTIES on a namespace
            doesn't mean any kind of TABLE/VIEW privileges, while
            TABLE_READ_PROPERTIES granted on a namespace would more
            clearly mean to inherit the ability to read table
            properties underneath that namespace.
          o Hierarchical grants probably also address some of the same
            use cases that people might otherwise address with FUTURE
            GRANTS, and for some scenarios FUTURE GRANTS might be the
            more complex or error-prone alternative
      * To handle the concept of Catalog-based storage-credential
        vending, we could introduce separate privileges
        TABLE_READ_DATA vs TABLE_READ_PROPERTIES and the mutate
        counter parts TABLE_WRITE_DATA vs TABLE_WRITE_PROPERTIES.
        Implementation-wise it could just mean
        TABLE_READ_DATA/TABLE_WRITE_DATA enable receiving
        appropriately-scoped storage credentials (e.g. read-only
        subscoped session token for TABLE_READ_DATA) in things like
        loadTable and createTable(stage-create=true). Whereas
        TABLE_READ_PROPERTIES/TABLE_WRITE_PROPERTIES would only enable
        whatever the REST Catalog server is able to handle directly in
        the REST request/response.

    Would love to hear anyone's thoughts on these areas.

    Cheers,
    Dennis Huo


    On 2024/06/08 19:12:10 Walaa Eldin Moustafa wrote:
    > Thanks Jack and team for working on this proposal. I went over
    it and it is
    > very well written. I particularly like:
    >
    > (1) The fact that it is adopting the SQL standard and adjusting
    some of its
    > semantics to fit the Iceberg model.
    >
    > (2) It includes views from v1. Views are a very important tool
    for policy
    > enforcement. We have built a dynamic privacy and compliance
    enforcement
    > catalog extension at LinkedIn using views [1], and one of the main
    > improvements to that catalog extension would be securable view
    objects.
    > Admittedly, it might require further improvements to compute
    engines to
    > implement the permissions, but having an Iceberg spec would be
    the first
    > step.
    >
    > Looking forward to the next steps of the proposal discussion and
    adoption.
    >
    > [1]
    >
    
https://www.slideshare.net/slideshow/viewshift-hassle-free-dynamic-policy-enforcement-for-every-data-lake/269577447
    >
    > Thanks,
    > Walaa.
    >
    >
    > On Thu, May 30, 2024 at 10:35 PM Jack Ye <ye...@gmail.com> wrote:
    >
    > > Hi everyone,
    > >
    > > Me and a few colleagues at AWS would like to discuss a new
    proposal for
    > > supporting securable objects in the Iceberg REST catalog spec.
    > >
    > > Here is our proposal in Google doc:
    > >
    
https://docs.google.com/document/d/1KmIDbPuN6IYF0nWs9ostXIB9F4b8iH3zZO0hjgs1lm4/edit
    > >
    > > And here is the corresponding GitHub issue:
    > > https://github.com/apache/iceberg/issues/10407
    > >
    > > I will also paste the intro here for an overview. There are 2
    main reasons
    > > for us to look into this area and draft this proposal:
    > >
    > > *IRC lacks clear guidelines on access management requirements:*
    > >
    > > This is feedback we heard frequently when interviewing AWS
    customers using
    > > Iceberg and considering building an IRC. Today Iceberg objects
    (namespaces,
    > > tables, views) are not securable within the Iceberg catalog
    itself, and
    > > need to be secured using an auxiliary system. This means that an
    > > organization building an IRC service needs to wrap many important
    > > operations into custom-built APIs for downstream users to
    consume (e.g. an
    > > API to grant Iceberg table access on S3 needs to grant
    corresponding IAM
    > > users/roles the right S3 policy or ACL setting). Huge amount
    of effort
    > > needs to be spent to figure out what are the missing APIs in
    IRC to satisfy
    > > enterprise level data warehouse access management requirements.
    > >
    > > There are some IRC products that offer vendor-specific APIs
    outside IRC to
    > > perform those operations, but this means that users are
    locked-in to this
    > > vendor’s securable object management system when using the IRC
    solution,
    > > and do not have the true freedom to easily switch to another
    solution if it
    > > offers better price-performance.
    > >
    > > We understand that Iceberg is not a security product, and it
    is not the
    > > best interest of the community to dive too deep into
    security-related
    > > domains. However, we believe that *we should at least offer
    the right
    > > interfaces and set the right standards for how Iceberg catalog
    expresses
    > > securable objects and how Iceberg catalog users interact with
    those objects*,
    > > such that (1) users that would like to build IRC can have a
    clear guideline
    > > of what API constract to implement for managing access to
    objects in IRC,
    > > and (2) users that are on one IRC product do not need to be
    locked-in due
    > > to access management aspects.
    > >
    > > Would really appreciate any feedback on this topic and proposal!
    > >
    > > Best,
    > > Jack Ye
    > >
    >

--
Robert Stupp
@snazy

Reply via email to