Re: Proposal for REST APIs for Iceberg table scans

Ryan Blue Wed, 31 Jan 2024 10:21:07 -0800

Looks good to me! Should we get a PR up to add it to the OpenAPI spec?

On Wed, Jan 31, 2024 at 10:16 AM Jack Ye <[email protected]> wrote:


> Sounds good. I don't really have any strong opinions here. So looks like
> we are landing on this?
>
>
> *PreplanTable: POST /v1/namespaces/ns/tables/t/preplan*{ "filter": {
> "type": "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"] }
>
> { "plan-tasks": [ { ... },  { ... } ] } // opaque object
>
>
> *PlanTable w/o a plan task: POST /v1/namespaces/ns/tables/t/plan*{
> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
> ["x", "a.b"] }
>
> { "file-scan-tasks": [ { ... }, { ... } ] } // FileScanTask OpenAPI model
>
>
> *PlanTable w/ a plan task: POST /v1/namespaces/ns/tables/t/plan*{
> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
> ["x", "a.b"], "plan-task": { ... } }
>
> { "file-scan-tasks": [ { ... }, { ... } ] }
>
> -Jack
>
> On Wed, Jan 31, 2024 at 10:08 AM Ryan Blue <[email protected]> wrote:
>
>> I agree with Dan. I'd rather have two endpoints instead of needing an
>> option that changes the behavior entirely in the same route. I don't think
>> that a `preplan` route would be too bad.
>>
>> On Wed, Jan 31, 2024 at 9:51 AM Daniel Weeks <[email protected]> wrote:
>>
>>> I agree with the opaque tokens.
>>>
>>> However, I'm concerned we're overloading the endpoint two perform two
>>> distinctly different operations: distribute a plan and scan a plan.
>>>
>>> Changing the task-type then changes the behavior and the result.  I feel
>>> it would be more straightforward to separate the distribute and scan
>>> endpoints.  Then clients can call the scan directly if they do not know how
>>> to distribute and the behavior is clear from the REST Specification.
>>>
>>> -Dan
>>>
>>> On Tue, Jan 30, 2024 at 9:09 PM Jack Ye <[email protected]> wrote:
>>>
>>>> +1 for having the opaque plan tasks, that's probably the most flexible
>>>> way forward. And let's call them *plan tasks* going forward to
>>>> standardize the terminology.
>>>>
>>>> I think the name of the APIs can be determined based on the actual API
>>>> shape. For example, if we centralize these 2 plan and pre-plan actions to a
>>>> single API endpoint but just requesting different task types:
>>>>
>>>>
>>>> *pre-plan: POST /v1/namespaces/ns/tables/t/plan*{ "filter": { "type":
>>>> "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"],
>>>> "task-type": "plan-task" }
>>>>
>>>> { "plan-tasks": [ { ... },  { ... } ] }
>>>>
>>>>
>>>> *plan without a plan-task: POST /v1/namespaces/ns/tables/t/plan*{
>>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
>>>> ["x", "a.b"], "task-type": "file-scan-task" } // file-scan-task should be
>>>> the default type
>>>>
>>>> { "file-scan-tasks": [ { ... }, { ... } ] }
>>>>
>>>>
>>>> *plan with a plan-task: POST /v1/namespaces/ns/tables/t/plan*{
>>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select":
>>>> ["x", "a.b"], "task-type": "file-scan-task", "plan-task": { ... } }
>>>>
>>>> { "file-scan-tasks": [...] }
>>>>
>>>> In this model, we just have a single API, and we can call it something
>>>> like PlanTable or PlanTableScan.
>>>>
>>>> What do you think?
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Mon, Jan 29, 2024 at 6:17 PM Renjie Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> But to move forward, I think we should go with the option that
>>>>>> preserves flexibility. I think the spec should state that plan tasks (if 
>>>>>> we
>>>>>> call them that) are a JSON object that should be sent as-is back to the
>>>>>> REST service to be used.
>>>>>
>>>>>
>>>>> +1 for this.
>>>>>
>>>>> > One more thing that I would also change is that I don't think the
>>>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" 
>>>>> portion
>>>>> of this as "planFiles" in the reference implementation, and "scan" is used
>>>>> for actually reading data. To be less confusing, I think that file scan
>>>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks
>>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else
>>>>> like the names "pre-plan" and "plan" better?
>>>>>
>>>>> I agree that "scan" may be quite confusing since it's actually
>>>>> planning file scan. Another options I can provide is: "plan" ->
>>>>> "plan-table-scan", "scan" -> "plan-file-scan"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 30, 2024 at 9:03 AM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> As you noted the main point we still need to decide on is whether to
>>>>>> have a standard "shard" definition (e.g. manifest plan task) or to allow 
>>>>>> it
>>>>>> to be opaque and specific to catalogs implementing the protocol. I've not
>>>>>> replied because I keep coming back to this decision and I'm not sure
>>>>>> whether the advantage is being clear about how it works (being explicit) 
>>>>>> or
>>>>>> allowing implementations to differ (opaque). I'm skeptical that there 
>>>>>> will
>>>>>> be other strategies.
>>>>>>
>>>>>> But to move forward, I think we should go with the option that
>>>>>> preserves flexibility. I think the spec should state that plan tasks (if 
>>>>>> we
>>>>>> call them that) are a JSON object that should be sent as-is back to the
>>>>>> REST service to be used.
>>>>>>
>>>>>> One more thing that I would also change is that I don't think the
>>>>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" 
>>>>>> portion
>>>>>> of this as "planFiles" in the reference implementation, and "scan" is 
>>>>>> used
>>>>>> for actually reading data. To be less confusing, I think that file scan
>>>>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks
>>>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else
>>>>>> like the names "pre-plan" and "plan" better?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Mon, Jan 29, 2024 at 12:02 PM Chertara, Rahil
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Hi All hope everyone is doing well,
>>>>>>>
>>>>>>>
>>>>>>> Wanted to revive the discussion around the Rest Table Scan API work.
>>>>>>> For a refresher here is the original proposal:
>>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>> as well as the PR: https://github.com/apache/iceberg/pull/9252
>>>>>>>
>>>>>>>
>>>>>>> From the last messages on the thread, I believe Ryan and Jack were
>>>>>>> in favor of having two distinct api endpoints /plan and /scan, as well 
>>>>>>> as a
>>>>>>> stricter json definition for the "shard”, here is an example below from
>>>>>>> what was discussed.
>>>>>>>
>>>>>>>
>>>>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in",
>>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>>>>
>>>>>>> { "manifest-plan-tasks": [
>>>>>>>   { "start": 0, "length": 1000, "manifest": { "path":
>>>>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] },
>>>>>>>   { ... }
>>>>>>> ]}
>>>>>>>
>>>>>>>
>>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>>>>> "term": "x", "values": [1, 2, 3] },
>>>>>>>
>>>>>>>   "select": ["x", "a.b"],
>>>>>>>   "manifest-plan-task": { "start": 0, "length": 1000, "manifest": {
>>>>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } }
>>>>>>>
>>>>>>> { "file-scan-tasks": [...] }
>>>>>>>
>>>>>>>
>>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>>>>
>>>>>>>
>>>>>>> { "file-scan-tasks": [...] }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> However IIRC Micah and Renjie had some concerns around this
>>>>>>> stricter structure as this can make it harder to evolve in the future, 
>>>>>>> as
>>>>>>> well as some potential scalability challenges for larger tables that 
>>>>>>> have
>>>>>>> many manifest files. (Feel free to expand further on the concerns if my
>>>>>>> understanding is incorrect).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Would appreciate if the community can leave any more
>>>>>>> thoughts/feedback on this thread, as well as on the google doc, and the 
>>>>>>> PR.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rahil Chertara
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From: *Renjie Liu <[email protected]>
>>>>>>> *Reply-To: *"[email protected]" <[email protected]>
>>>>>>> *Date: *Thursday, December 21, 2023 at 10:35 PM
>>>>>>> *To: *"[email protected]" <[email protected]>
>>>>>>> *Subject: *RE: [EXTERNAL] Proposal for REST APIs for Iceberg table
>>>>>>> scans
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *CAUTION*: This email originated from outside of the organization.
>>>>>>> Do not click links or open attachments unless you can confirm the sender
>>>>>>> and know the content is safe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I share the same concern with Micah. The shard detail should be
>>>>>>> implementation details of the server, rather than exposing directly to 
>>>>>>> the
>>>>>>> client. If the goal is to make things stateless, we just need to attach 
>>>>>>> a
>>>>>>> snapshot id + shard id, then a determined algorithm is supposed to give 
>>>>>>> the
>>>>>>> same result. Also another concern is for huge analytics tables, we may 
>>>>>>> have
>>>>>>> a lot of manifest files, which may lead to large traffic from the rest
>>>>>>> server.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 21, 2023 at 7:41 AM Micah Kornfield <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> Also +1 for having a more strict definition of the shard. Having
>>>>>>> arbitrary JSON was basically what we experimented with a string shard 
>>>>>>> ID,
>>>>>>> and we ended up with something very similar to the manifest plan task 
>>>>>>> you
>>>>>>> describe in the serialized ID string.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> IIUC the proposal correctly, I'd actually be -0.0 on the stricter
>>>>>>> structure.  I think forcing a contract where it isn't strictly necessary
>>>>>>> makes it harder to evolve the system in the future.  For example it 
>>>>>>> makes
>>>>>>> it harder to address potential scalability problems in a transparent way
>>>>>>> (e.g. extreme edge cases in cardinality between manifest files and 
>>>>>>> delete
>>>>>>> files).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It also seems like it might overly constrain implementations (it is
>>>>>>> not clear we should need to compute the mapping between delete file
>>>>>>> manifests to data file manifests up front to start planning).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 19, 2023 at 2:10 PM Jack Ye <[email protected]> wrote:
>>>>>>>
>>>>>>> +1 for having /plan and /scan, sounds like a good idea to separate
>>>>>>> those 2 distinct actions.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Also +1 for having a more strict definition of the shard. Having
>>>>>>> arbitrary JSON was basically what we experimented with a string shard 
>>>>>>> ID,
>>>>>>> and we ended up with something very similar to the manifest plan task 
>>>>>>> you
>>>>>>> describe in the serialized ID string.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So sounds like we are converging to the following APIs:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in",
>>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>>>>
>>>>>>> { "manifest-plan-tasks": [
>>>>>>>   { "start": 0, "length": 1000, "manifest": { "path":
>>>>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] },
>>>>>>>   { ... }
>>>>>>> ]}
>>>>>>>
>>>>>>>
>>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>>>>> "term": "x", "values": [1, 2, 3] },
>>>>>>>
>>>>>>>   "select": ["x", "a.b"],
>>>>>>>   "manifest-plan-task": { "start": 0, "length": 1000, "manifest": {
>>>>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } }
>>>>>>>
>>>>>>> { "file-scan-tasks": [...] }
>>>>>>>
>>>>>>>
>>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in",
>>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>>>>>>
>>>>>>>
>>>>>>> { "file-scan-tasks": [...] }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If this sounds good overall, we can update the prototype to have
>>>>>>> more detailed discussions in code.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 6:10 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>> The tasks might look something like this:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> CombinedPlanTask
>>>>>>>
>>>>>>> - List<ManifestPlanTask>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ManifestPlanTask
>>>>>>>
>>>>>>> - int start
>>>>>>>
>>>>>>> - int length
>>>>>>>
>>>>>>> - ManifestFile dataManifest
>>>>>>>
>>>>>>> - List<ManifestFile> deleteManifests
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>> Seems like that track has expired (This Internet-Draft will expire
>>>>>>> on 13 May 2022)
>>>>>>>
>>>>>>> Yeah, looks like we should just use POST. That’s too bad. QUERY
>>>>>>> seems like a good idea to me.
>>>>>>>
>>>>>>> Distinguish planning using shard or not
>>>>>>>
>>>>>>> I think this was a mistake on my part. I was still thinking that we
>>>>>>> would have a different endpoint for first-level planning to produce 
>>>>>>> shards
>>>>>>> and the route to actually get files. Since both are POST requests with 
>>>>>>> the
>>>>>>> same path (/v1/namespaces/ns/tables/t/scans) that no longer works.
>>>>>>> What about /v1/namespaces/ns/tables/t/scan and
>>>>>>> /v1/namespaces/ns/tables/t/plan? The latter could use some variant
>>>>>>> of planFiles since that’s what we are wrapping in the Java API.
>>>>>>>
>>>>>>> Necessity of scan ID
>>>>>>>
>>>>>>> Yes, I agree. If you have shard IDs then you don’t really need a
>>>>>>> scan ID. You could always have one internally but send it as part of the
>>>>>>> shard ID.
>>>>>>>
>>>>>>> Shape of shard payload
>>>>>>>
>>>>>>> I think we have 2 general options depending on how strict we want to
>>>>>>> be.
>>>>>>>
>>>>>>>    1. Require a standard shard definition
>>>>>>>    2. Allow arbitrary JSON and leave it to the service
>>>>>>>
>>>>>>> I lean toward the first option, which would be a data manifest and
>>>>>>> the associated delete manifests for the partition. We could also extend
>>>>>>> that to a group of manifests, each with a list of delete manifests. And 
>>>>>>> we
>>>>>>> could also allow splitting to ensure tasks don’t get too large with big
>>>>>>> files. This all looks basically like FileScanTask, but with manifests 
>>>>>>> and
>>>>>>> delete manifests.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 4:39 PM Jack Ye <[email protected]> wrote:
>>>>>>>
>>>>>>> Seems like that track has expired (This Internet-Draft will expire
>>>>>>> on 13 May 2022), not sure how these RFCs are managed, but it does not 
>>>>>>> seem
>>>>>>> hopeful to have this verb in. I think people are mostly using POST for 
>>>>>>> this
>>>>>>> use case already.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> But overall I think we are in agreement with the general direction.
>>>>>>> A few detail discussions:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Distinguish planning using shard or not*
>>>>>>>
>>>>>>> Maybe we should add a query parameter like *distributed=true* to
>>>>>>> distinguish your first and third case, since they are now sharing the 
>>>>>>> same
>>>>>>> signature. If the requester wants to use distributed planning, then some
>>>>>>> sharding strategy is provided as a response for the requester to send 
>>>>>>> more
>>>>>>> detailed requests.
>>>>>>>
>>>>>>> *Necessity of scan ID*
>>>>>>> In this approach, is scan ID still required? Because the shard
>>>>>>> payload already fully describes the information to retrieve, it seems 
>>>>>>> like
>>>>>>> we can just drop the *scan-id* query parameter in the second case.
>>>>>>> Seems like it's kept for the case if we still want to persist some 
>>>>>>> state,
>>>>>>> but it seems like we can make a stateless style fully working.
>>>>>>>
>>>>>>> *Shape of shard payload*
>>>>>>> What do you think is necessary information of the shard payload? It
>>>>>>> seems like we need at least the location of the manifests, plus the 
>>>>>>> delete
>>>>>>> manifests or delete files associated with the manifests. I like the 
>>>>>>> idea of
>>>>>>> making it a "shard task" that is similar to a file scan task, and it 
>>>>>>> might
>>>>>>> allow us to return a mixture of both types of tasks, so we can have 
>>>>>>> better
>>>>>>> control of the response size.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>> I just changed it to POST after looking into support for the QUERY
>>>>>>> method. It's a new HTTP method for cases like this where you don't want 
>>>>>>> to
>>>>>>> pass everything through query params. Here's the QUERY method RFC
>>>>>>> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>,
>>>>>>> but I guess it isn't finalized yet?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Just read them like you would a POST request that doesn't actually
>>>>>>> create anything.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <[email protected]> wrote:
>>>>>>>
>>>>>>> Thanks, the Gist explains a lot of things. This is actually very
>>>>>>> close to our way of implementing the shard ID, we were defining the 
>>>>>>> shard
>>>>>>> ID as a string, and the string content is actually something similar to 
>>>>>>> the
>>>>>>> information of the JSON payload you showed, so we can persist minimum
>>>>>>> information in storage.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Just one clarification needed for your Gist:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type":
>>>>>>> "in", "term": "x", "values": [1, 2, 3] } }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> >
>>>>>>>
>>>>>>> > { "file-scan-tasks": [...] }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we
>>>>>>> cannot have a request body. That's actually why we expressed that as an 
>>>>>>> ID
>>>>>>> string, since we can put it as a query parameter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>> Jack,
>>>>>>>
>>>>>>> It sounds like what I’m proposing isn’t quite clear because your
>>>>>>> initial response was arguing for a sharding capability. I agree that
>>>>>>> sharding is a good idea. I’m less confident about two points:
>>>>>>>
>>>>>>>    1. Requiring that the service is stateful. As Renjie pointed
>>>>>>>    out, that makes it harder to scale the service.
>>>>>>>    2. The need for both pagination *and* sharding as separate
>>>>>>>    things
>>>>>>>
>>>>>>> And I also think that Fokko has a good point about trying to keep
>>>>>>> things simple and not requiring the CreateScan endpoint.
>>>>>>>
>>>>>>> For the first point, I’m proposing that we still have a CreateScan
>>>>>>> endpoint, but instead of sending only a list of shard IDs it can also 
>>>>>>> send
>>>>>>> either a standard shard “task” or an optional JSON definition. Let’s 
>>>>>>> assume
>>>>>>> we can send arbitrary JSON for an example. Say I have a table with 4
>>>>>>> manifests, A through D and that C and D match some query filter.
>>>>>>> When I call the CreateScan endpoint, the service would send back
>>>>>>> tasks with that information: {"id": 1, "manifests": ["C"]}, {"id":
>>>>>>> 2, "manifests": ["D"]}. By sending what the shards mean (the
>>>>>>> manifests to read), my service can be stateless: any node can get a 
>>>>>>> request
>>>>>>> for shard 1, read manifest C, and send back the resulting data
>>>>>>> files.
>>>>>>>
>>>>>>> I don’t see much of an argument against doing this *in principle*.
>>>>>>> It gives you the flexibility to store state if you choose or to send 
>>>>>>> state
>>>>>>> to the client for it to pass back when calling the GetTasks
>>>>>>> endpoint. There is a practical problem, which is that it’s annoying to 
>>>>>>> send
>>>>>>> a GET request with a JSON payload because you can’t send a request body.
>>>>>>> It’s probably obvious, but I’m also not a REST purist so I’d be fine 
>>>>>>> using
>>>>>>> POST or QUERY for this. It would look something like this Gist
>>>>>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>.
>>>>>>>
>>>>>>> In your last reply, you also asked whether a stateless service is a
>>>>>>> goal. I don’t think that it is, but if we can make simple changes to the
>>>>>>> spec to allow more flexibility on the server side, I think that’s a good
>>>>>>> direction. You also asked about a reference implementation and I 
>>>>>>> consider
>>>>>>> CatalogHandlers
>>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java>
>>>>>>> to be that reference. It does everything except for the work done by 
>>>>>>> your
>>>>>>> choice of web application framework. It isn’t stateless, but it only 
>>>>>>> relies
>>>>>>> on a Catalog implementation for persistence.
>>>>>>>
>>>>>>> For the second point, I don’t understand why we need both sharding
>>>>>>> and pagination. That is, if we have a protocol that allows sharding, 
>>>>>>> why is
>>>>>>> pagination also needed? From my naive perspective on how sharding would
>>>>>>> work, we should be able to use metadata from the manifest list to limit 
>>>>>>> the
>>>>>>> potential number of data files in a given shard. As long as we can limit
>>>>>>> the size of a shard to produce more, pagination seems like unnecessary
>>>>>>> complication.
>>>>>>>
>>>>>>> Lastly, for Fokko’s point, I think another easy extension to the
>>>>>>> proposal is to support a direct call to GetTasks. There’s a
>>>>>>> trade-off here, but if you’re already sending the original filter along
>>>>>>> with the request (in order to filter records from manifest C for
>>>>>>> instance) then the request is already something the protocol can 
>>>>>>> express.
>>>>>>> There’s an objection concerning resource consumption on the service and
>>>>>>> creating responses that are too large or take too long, but we can get
>>>>>>> around that by responding with a code that instructs the client to use 
>>>>>>> the
>>>>>>> CreateScan API like 413 (Payload too large). I think that would
>>>>>>> allow simple clients to function for all but really large tables. The 
>>>>>>> gist
>>>>>>> above also shows what this might look like.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> The current proposal definitely makes the server stateful. In our
>>>>>>> prototype we used other components like DynamoDB to keep track of 
>>>>>>> states.
>>>>>>> If keeping it stateless is a tenant we can definitely make the proposal
>>>>>>> closer to that direction. Maybe one thing to make sure is, is this a 
>>>>>>> core
>>>>>>> tenant of the REST spec? Today we do not even have an official reference
>>>>>>> implementation of the REST server, I feel it is hard to say what are the
>>>>>>> core tenants. Maybe we should create one?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Pagination is a common issue in the REST spec. We also see similar
>>>>>>> limitations with other APIs like GetTables, GetNamespaces. When a 
>>>>>>> catalog
>>>>>>> has many namespaces and tables it suffers from the same issue. It is 
>>>>>>> also
>>>>>>> not ideal for use cases like web browsers, since typically you display a
>>>>>>> small page of results and do not need the full list immediately. So I 
>>>>>>> feel
>>>>>>> we cannot really avoid some state to be kept for those use cases.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Chunked response might be a good way to work around it. We also
>>>>>>> thought about using HTTP2. However, these options seem to be not very
>>>>>>> compatible with OpenAPI. We can do some further research in this domain,
>>>>>>> would really appreciate it if anyone has more insights and experience 
>>>>>>> with
>>>>>>> OpenAPI that can provide some suggestions.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi, Rahi and Jack:
>>>>>>>
>>>>>>> Thanks for raising this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My question is that the pagination and sharding will make the rest
>>>>>>> server stateful, e.g. a sequence of calls is required to go to the same
>>>>>>> server. In this case, how do we ensure the scalability of the rest 
>>>>>>> server?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hey Rahil and Jack,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for bringing this up. Ryan and I also discussed this briefly
>>>>>>> in the early days of PyIceberg and it would have helped a lot in the 
>>>>>>> speed
>>>>>>> of development. We went for the traditional approach because that would
>>>>>>> also support all the other catalogs, but now that the REST catalog is
>>>>>>> taking off, I think it still makes a lot of sense to get it in.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I do share the concern raised Ryan around the concepts of shards and
>>>>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>>>>>>> living in a single process today the concept of shards doesn't add 
>>>>>>> value. I
>>>>>>> see your concern with long-running jobs, but for the non-distributed 
>>>>>>> cases,
>>>>>>> it will add additional complexity.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Some suggestions that come to mind:
>>>>>>>
>>>>>>>    - Stream the tasks directly back using a chunked
>>>>>>>    response, reducing the latency to the first task. This would also 
>>>>>>> solve
>>>>>>>    things with the pagination. The only downside I can think of is 
>>>>>>> having
>>>>>>>    delete files where you first need to make sure there are deletes 
>>>>>>> relevant
>>>>>>>    to the task, this might increase latency to the first task.
>>>>>>>    - Making the sharding optional. If you want to shard you call
>>>>>>>    the CreateScan first and then call the GetScanTask with the IDs. If 
>>>>>>> you
>>>>>>>    don't want to shard, you omit the shard parameter and fetch the tasks
>>>>>>>    directly (here we need also replace the scan string with the full
>>>>>>>    column/expression/snapshot-id etc).
>>>>>>>
>>>>>>> Looking forward to discussing this tomorrow in the community sync
>>>>>>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Fokko
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <[email protected]>:
>>>>>>>
>>>>>>> Hi Ryan, thanks for the feedback!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I was a part of this design discussion internally and can provide
>>>>>>> more details. One reason for separating the CreateScan operation was to
>>>>>>> make the API asynchronous and thus keep HTTP communications short. 
>>>>>>> Consider
>>>>>>> the case where we only have GetScanTasks API, and there is no shard
>>>>>>> specified. It might take tens of seconds, or even minutes to read 
>>>>>>> through
>>>>>>> all the manifest list and manifests before being able to return 
>>>>>>> anything.
>>>>>>> This means the HTTP connection has to remain open during that period, 
>>>>>>> which
>>>>>>> is not really a good practice in general (consider connection failure, 
>>>>>>> load
>>>>>>> balancer and proxy load, etc.). And when we shift the API to 
>>>>>>> asynchronous,
>>>>>>> it basically becomes something like the proposal, where a stateful ID is
>>>>>>> generated to be able to immediately return back to the client, and the
>>>>>>> client get results by referencing the ID. So in our current prototype
>>>>>>> implementation we are actually keeping this ID and the whole REST 
>>>>>>> service
>>>>>>> is stateful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There were some thoughts we had about the possibility to define a
>>>>>>> "shard ID generator" protocol: basically the client agrees with the 
>>>>>>> service
>>>>>>> a way to deterministically generate shard IDs, and service uses it to
>>>>>>> create shards. That sounds like what you are suggesting here, and it 
>>>>>>> pushes
>>>>>>> the responsibility to the client side to determine the parallelism. But 
>>>>>>> in
>>>>>>> some bad cases (e.g. there are many delete files and we need to read all
>>>>>>> those in each shard to apply filters), it seems like there might still 
>>>>>>> be
>>>>>>> the long open connection issue above. What is your thought on that?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>> Rahil, thanks for working on this. It has some really good ideas
>>>>>>> that we hadn't considered before like a way for the service to plan how 
>>>>>>> to
>>>>>>> break up the work of scan planning. I really like that idea because it
>>>>>>> makes it much easier for the service to keep memory consumption low 
>>>>>>> across
>>>>>>> requests.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My primary feedback is that I think it's a little too complicated
>>>>>>> (with both sharding and pagination) and could be modified slightly so 
>>>>>>> that
>>>>>>> the service doesn't need to be stateful. If the service isn't 
>>>>>>> necessarily
>>>>>>> stateful then it should be easier to build implementations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> To make it possible for the service to be stateless, I'm proposing
>>>>>>> that rather than creating shard IDs that are tracked by the service, the
>>>>>>> information for a shard can be sent to the client. My assumption here is
>>>>>>> that most implementations would create shards by reading the manifest 
>>>>>>> list,
>>>>>>> filtering on partition ranges, and creating a shard for some reasonable
>>>>>>> size of manifest content. For example, if a table has 100MB of metadata 
>>>>>>> in
>>>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards 
>>>>>>> with
>>>>>>> 1-4 manifests each. The service could send those shards to the client 
>>>>>>> as a
>>>>>>> list of manifests to read and the client could send the shard 
>>>>>>> information
>>>>>>> back to the service to get the data files in each shard (along with the
>>>>>>> original filter).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There's a slight trade-off that the protocol needs to define how to
>>>>>>> break the work into shards. I'm interested in hearing if that would work
>>>>>>> with how you were planning on building the service on your end. Another
>>>>>>> option is to let the service send back arbitrary JSON that would get
>>>>>>> returned for each shard. Either way, I like that this would make it so 
>>>>>>> the
>>>>>>> service doesn't need to persist anything. We could also make it so that
>>>>>>> small tables don't require multiple requests. For example, a client 
>>>>>>> could
>>>>>>> call the route to get file tasks with just a filter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at
>>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new 
>>>>>>> Scan
>>>>>>> API that will be utilized by the RESTCatalog. The process for table scan
>>>>>>> planning is currently done within client engines such as Apache Spark. 
>>>>>>> By
>>>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>>>>> table scans with external services, which can lead to several benefits.
>>>>>>>
>>>>>>> For example, we can leverage caching and indexes on the server side
>>>>>>> to improve planning performance. Furthermore, by moving this scan logic 
>>>>>>> to
>>>>>>> the RESTCatalog, non-JVM engines can integrate more easily. This all 
>>>>>>> can be
>>>>>>> found in the detailed proposal below. Feel free to comment, and add your
>>>>>>> suggestions .
>>>>>>>
>>>>>>> Detailed proposal:
>>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>>
>>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rahil Chertara
>>>>>>> Amazon EMR & Athena
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to