Looks good to me! Should we get a PR up to add it to the OpenAPI spec? On Wed, Jan 31, 2024 at 10:16 AM Jack Ye <yezhao...@gmail.com> wrote:
> Sounds good. I don't really have any strong opinions here. So looks like > we are landing on this? > > > *PreplanTable: POST /v1/namespaces/ns/tables/t/preplan*{ "filter": { > "type": "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"] } > > { "plan-tasks": [ { ... }, { ... } ] } // opaque object > > > *PlanTable w/o a plan task: POST /v1/namespaces/ns/tables/t/plan*{ > "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select": > ["x", "a.b"] } > > { "file-scan-tasks": [ { ... }, { ... } ] } // FileScanTask OpenAPI model > > > *PlanTable w/ a plan task: POST /v1/namespaces/ns/tables/t/plan*{ > "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select": > ["x", "a.b"], "plan-task": { ... } } > > { "file-scan-tasks": [ { ... }, { ... } ] } > > -Jack > > On Wed, Jan 31, 2024 at 10:08 AM Ryan Blue <b...@tabular.io> wrote: > >> I agree with Dan. I'd rather have two endpoints instead of needing an >> option that changes the behavior entirely in the same route. I don't think >> that a `preplan` route would be too bad. >> >> On Wed, Jan 31, 2024 at 9:51 AM Daniel Weeks <dwe...@apache.org> wrote: >> >>> I agree with the opaque tokens. >>> >>> However, I'm concerned we're overloading the endpoint two perform two >>> distinctly different operations: distribute a plan and scan a plan. >>> >>> Changing the task-type then changes the behavior and the result. I feel >>> it would be more straightforward to separate the distribute and scan >>> endpoints. Then clients can call the scan directly if they do not know how >>> to distribute and the behavior is clear from the REST Specification. >>> >>> -Dan >>> >>> On Tue, Jan 30, 2024 at 9:09 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> +1 for having the opaque plan tasks, that's probably the most flexible >>>> way forward. And let's call them *plan tasks* going forward to >>>> standardize the terminology. >>>> >>>> I think the name of the APIs can be determined based on the actual API >>>> shape. For example, if we centralize these 2 plan and pre-plan actions to a >>>> single API endpoint but just requesting different task types: >>>> >>>> >>>> *pre-plan: POST /v1/namespaces/ns/tables/t/plan*{ "filter": { "type": >>>> "in", "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"], >>>> "task-type": "plan-task" } >>>> >>>> { "plan-tasks": [ { ... }, { ... } ] } >>>> >>>> >>>> *plan without a plan-task: POST /v1/namespaces/ns/tables/t/plan*{ >>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select": >>>> ["x", "a.b"], "task-type": "file-scan-task" } // file-scan-task should be >>>> the default type >>>> >>>> { "file-scan-tasks": [ { ... }, { ... } ] } >>>> >>>> >>>> *plan with a plan-task: POST /v1/namespaces/ns/tables/t/plan*{ >>>> "filter": {"type": "in", "term": "x", "values": [1, 2, 3] }, "select": >>>> ["x", "a.b"], "task-type": "file-scan-task", "plan-task": { ... } } >>>> >>>> { "file-scan-tasks": [...] } >>>> >>>> In this model, we just have a single API, and we can call it something >>>> like PlanTable or PlanTableScan. >>>> >>>> What do you think? >>>> >>>> -Jack >>>> >>>> >>>> On Mon, Jan 29, 2024 at 6:17 PM Renjie Liu <liurenjie2...@gmail.com> >>>> wrote: >>>> >>>>> But to move forward, I think we should go with the option that >>>>>> preserves flexibility. I think the spec should state that plan tasks (if >>>>>> we >>>>>> call them that) are a JSON object that should be sent as-is back to the >>>>>> REST service to be used. >>>>> >>>>> >>>>> +1 for this. >>>>> >>>>> > One more thing that I would also change is that I don't think the >>>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" >>>>> portion >>>>> of this as "planFiles" in the reference implementation, and "scan" is used >>>>> for actually reading data. To be less confusing, I think that file scan >>>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks >>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else >>>>> like the names "pre-plan" and "plan" better? >>>>> >>>>> I agree that "scan" may be quite confusing since it's actually >>>>> planning file scan. Another options I can provide is: "plan" -> >>>>> "plan-table-scan", "scan" -> "plan-file-scan" >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jan 30, 2024 at 9:03 AM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> As you noted the main point we still need to decide on is whether to >>>>>> have a standard "shard" definition (e.g. manifest plan task) or to allow >>>>>> it >>>>>> to be opaque and specific to catalogs implementing the protocol. I've not >>>>>> replied because I keep coming back to this decision and I'm not sure >>>>>> whether the advantage is being clear about how it works (being explicit) >>>>>> or >>>>>> allowing implementations to differ (opaque). I'm skeptical that there >>>>>> will >>>>>> be other strategies. >>>>>> >>>>>> But to move forward, I think we should go with the option that >>>>>> preserves flexibility. I think the spec should state that plan tasks (if >>>>>> we >>>>>> call them that) are a JSON object that should be sent as-is back to the >>>>>> REST service to be used. >>>>>> >>>>>> One more thing that I would also change is that I don't think the >>>>>> "plan" and "scan" endpoints make much sense. We refer to the "scan" >>>>>> portion >>>>>> of this as "planFiles" in the reference implementation, and "scan" is >>>>>> used >>>>>> for actually reading data. To be less confusing, I think that file scan >>>>>> tasks should be returned by a "plan" endpoint and the manifest plan tasks >>>>>> (or shards) should be returned by a "pre-plan" endpoint. Does anyone else >>>>>> like the names "pre-plan" and "plan" better? >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Mon, Jan 29, 2024 at 12:02 PM Chertara, Rahil >>>>>> <rcher...@amazon.com.invalid> wrote: >>>>>> >>>>>>> Hi All hope everyone is doing well, >>>>>>> >>>>>>> >>>>>>> Wanted to revive the discussion around the Rest Table Scan API work. >>>>>>> For a refresher here is the original proposal: >>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>>>>> as well as the PR: https://github.com/apache/iceberg/pull/9252 >>>>>>> >>>>>>> >>>>>>> From the last messages on the thread, I believe Ryan and Jack were >>>>>>> in favor of having two distinct api endpoints /plan and /scan, as well >>>>>>> as a >>>>>>> stricter json definition for the "shard”, here is an example below from >>>>>>> what was discussed. >>>>>>> >>>>>>> >>>>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in", >>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>>>> >>>>>>> { "manifest-plan-tasks": [ >>>>>>> { "start": 0, "length": 1000, "manifest": { "path": >>>>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] }, >>>>>>> { ... } >>>>>>> ]} >>>>>>> >>>>>>> >>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>>>> "term": "x", "values": [1, 2, 3] }, >>>>>>> >>>>>>> "select": ["x", "a.b"], >>>>>>> "manifest-plan-task": { "start": 0, "length": 1000, "manifest": { >>>>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } } >>>>>>> >>>>>>> { "file-scan-tasks": [...] } >>>>>>> >>>>>>> >>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>>>> >>>>>>> >>>>>>> { "file-scan-tasks": [...] } >>>>>>> >>>>>>> >>>>>>> >>>>>>> However IIRC Micah and Renjie had some concerns around this >>>>>>> stricter structure as this can make it harder to evolve in the future, >>>>>>> as >>>>>>> well as some potential scalability challenges for larger tables that >>>>>>> have >>>>>>> many manifest files. (Feel free to expand further on the concerns if my >>>>>>> understanding is incorrect). >>>>>>> >>>>>>> >>>>>>> >>>>>>> Would appreciate if the community can leave any more >>>>>>> thoughts/feedback on this thread, as well as on the google doc, and the >>>>>>> PR. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Rahil Chertara >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From: *Renjie Liu <liurenjie2...@gmail.com> >>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>>>>>> *Date: *Thursday, December 21, 2023 at 10:35 PM >>>>>>> *To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>>>>>> *Subject: *RE: [EXTERNAL] Proposal for REST APIs for Iceberg table >>>>>>> scans >>>>>>> >>>>>>> >>>>>>> >>>>>>> *CAUTION*: This email originated from outside of the organization. >>>>>>> Do not click links or open attachments unless you can confirm the sender >>>>>>> and know the content is safe. >>>>>>> >>>>>>> >>>>>>> >>>>>>> I share the same concern with Micah. The shard detail should be >>>>>>> implementation details of the server, rather than exposing directly to >>>>>>> the >>>>>>> client. If the goal is to make things stateless, we just need to attach >>>>>>> a >>>>>>> snapshot id + shard id, then a determined algorithm is supposed to give >>>>>>> the >>>>>>> same result. Also another concern is for huge analytics tables, we may >>>>>>> have >>>>>>> a lot of manifest files, which may lead to large traffic from the rest >>>>>>> server. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 21, 2023 at 7:41 AM Micah Kornfield < >>>>>>> emkornfi...@gmail.com> wrote: >>>>>>> >>>>>>> Also +1 for having a more strict definition of the shard. Having >>>>>>> arbitrary JSON was basically what we experimented with a string shard >>>>>>> ID, >>>>>>> and we ended up with something very similar to the manifest plan task >>>>>>> you >>>>>>> describe in the serialized ID string. >>>>>>> >>>>>>> >>>>>>> >>>>>>> IIUC the proposal correctly, I'd actually be -0.0 on the stricter >>>>>>> structure. I think forcing a contract where it isn't strictly necessary >>>>>>> makes it harder to evolve the system in the future. For example it >>>>>>> makes >>>>>>> it harder to address potential scalability problems in a transparent way >>>>>>> (e.g. extreme edge cases in cardinality between manifest files and >>>>>>> delete >>>>>>> files). >>>>>>> >>>>>>> >>>>>>> >>>>>>> It also seems like it might overly constrain implementations (it is >>>>>>> not clear we should need to compute the mapping between delete file >>>>>>> manifests to data file manifests up front to start planning). >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Dec 19, 2023 at 2:10 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>> +1 for having /plan and /scan, sounds like a good idea to separate >>>>>>> those 2 distinct actions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Also +1 for having a more strict definition of the shard. Having >>>>>>> arbitrary JSON was basically what we experimented with a string shard >>>>>>> ID, >>>>>>> and we ended up with something very similar to the manifest plan task >>>>>>> you >>>>>>> describe in the serialized ID string. >>>>>>> >>>>>>> >>>>>>> >>>>>>> So sounds like we are converging to the following APIs: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *POST /v1/namespaces/ns/tables/t/plan *{ "filter": { "type": "in", >>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>>>> >>>>>>> { "manifest-plan-tasks": [ >>>>>>> { "start": 0, "length": 1000, "manifest": { "path": >>>>>>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] }, >>>>>>> { ... } >>>>>>> ]} >>>>>>> >>>>>>> >>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>>>> "term": "x", "values": [1, 2, 3] }, >>>>>>> >>>>>>> "select": ["x", "a.b"], >>>>>>> "manifest-plan-task": { "start": 0, "length": 1000, "manifest": { >>>>>>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } } >>>>>>> >>>>>>> { "file-scan-tasks": [...] } >>>>>>> >>>>>>> >>>>>>> *POST /v1/namespaces/ns/tables/t/scan *{ "filter": {"type": "in", >>>>>>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]} >>>>>>> >>>>>>> >>>>>>> { "file-scan-tasks": [...] } >>>>>>> >>>>>>> >>>>>>> >>>>>>> If this sounds good overall, we can update the prototype to have >>>>>>> more detailed discussions in code. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 14, 2023 at 6:10 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>> The tasks might look something like this: >>>>>>> >>>>>>> >>>>>>> >>>>>>> CombinedPlanTask >>>>>>> >>>>>>> - List<ManifestPlanTask> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ManifestPlanTask >>>>>>> >>>>>>> - int start >>>>>>> >>>>>>> - int length >>>>>>> >>>>>>> - ManifestFile dataManifest >>>>>>> >>>>>>> - List<ManifestFile> deleteManifests >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>> Seems like that track has expired (This Internet-Draft will expire >>>>>>> on 13 May 2022) >>>>>>> >>>>>>> Yeah, looks like we should just use POST. That’s too bad. QUERY >>>>>>> seems like a good idea to me. >>>>>>> >>>>>>> Distinguish planning using shard or not >>>>>>> >>>>>>> I think this was a mistake on my part. I was still thinking that we >>>>>>> would have a different endpoint for first-level planning to produce >>>>>>> shards >>>>>>> and the route to actually get files. Since both are POST requests with >>>>>>> the >>>>>>> same path (/v1/namespaces/ns/tables/t/scans) that no longer works. >>>>>>> What about /v1/namespaces/ns/tables/t/scan and >>>>>>> /v1/namespaces/ns/tables/t/plan? The latter could use some variant >>>>>>> of planFiles since that’s what we are wrapping in the Java API. >>>>>>> >>>>>>> Necessity of scan ID >>>>>>> >>>>>>> Yes, I agree. If you have shard IDs then you don’t really need a >>>>>>> scan ID. You could always have one internally but send it as part of the >>>>>>> shard ID. >>>>>>> >>>>>>> Shape of shard payload >>>>>>> >>>>>>> I think we have 2 general options depending on how strict we want to >>>>>>> be. >>>>>>> >>>>>>> 1. Require a standard shard definition >>>>>>> 2. Allow arbitrary JSON and leave it to the service >>>>>>> >>>>>>> I lean toward the first option, which would be a data manifest and >>>>>>> the associated delete manifests for the partition. We could also extend >>>>>>> that to a group of manifests, each with a list of delete manifests. And >>>>>>> we >>>>>>> could also allow splitting to ensure tasks don’t get too large with big >>>>>>> files. This all looks basically like FileScanTask, but with manifests >>>>>>> and >>>>>>> delete manifests. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 13, 2023 at 4:39 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>> Seems like that track has expired (This Internet-Draft will expire >>>>>>> on 13 May 2022), not sure how these RFCs are managed, but it does not >>>>>>> seem >>>>>>> hopeful to have this verb in. I think people are mostly using POST for >>>>>>> this >>>>>>> use case already. >>>>>>> >>>>>>> >>>>>>> >>>>>>> But overall I think we are in agreement with the general direction. >>>>>>> A few detail discussions: >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Distinguish planning using shard or not* >>>>>>> >>>>>>> Maybe we should add a query parameter like *distributed=true* to >>>>>>> distinguish your first and third case, since they are now sharing the >>>>>>> same >>>>>>> signature. If the requester wants to use distributed planning, then some >>>>>>> sharding strategy is provided as a response for the requester to send >>>>>>> more >>>>>>> detailed requests. >>>>>>> >>>>>>> *Necessity of scan ID* >>>>>>> In this approach, is scan ID still required? Because the shard >>>>>>> payload already fully describes the information to retrieve, it seems >>>>>>> like >>>>>>> we can just drop the *scan-id* query parameter in the second case. >>>>>>> Seems like it's kept for the case if we still want to persist some >>>>>>> state, >>>>>>> but it seems like we can make a stateless style fully working. >>>>>>> >>>>>>> *Shape of shard payload* >>>>>>> What do you think is necessary information of the shard payload? It >>>>>>> seems like we need at least the location of the manifests, plus the >>>>>>> delete >>>>>>> manifests or delete files associated with the manifests. I like the >>>>>>> idea of >>>>>>> making it a "shard task" that is similar to a file scan task, and it >>>>>>> might >>>>>>> allow us to return a mixture of both types of tasks, so we can have >>>>>>> better >>>>>>> control of the response size. >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>> I just changed it to POST after looking into support for the QUERY >>>>>>> method. It's a new HTTP method for cases like this where you don't want >>>>>>> to >>>>>>> pass everything through query params. Here's the QUERY method RFC >>>>>>> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>, >>>>>>> but I guess it isn't finalized yet? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Just read them like you would a POST request that doesn't actually >>>>>>> create anything. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>> Thanks, the Gist explains a lot of things. This is actually very >>>>>>> close to our way of implementing the shard ID, we were defining the >>>>>>> shard >>>>>>> ID as a string, and the string content is actually something similar to >>>>>>> the >>>>>>> information of the JSON payload you showed, so we can persist minimum >>>>>>> information in storage. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Just one clarification needed for your Gist: >>>>>>> >>>>>>> >>>>>>> >>>>>>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1 >>>>>>> >>>>>>> >>>>>>> >>>>>>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": >>>>>>> "in", "term": "x", "values": [1, 2, 3] } } >>>>>>> >>>>>>> >>>>>>> >>>>>>> > >>>>>>> >>>>>>> > { "file-scan-tasks": [...] } >>>>>>> >>>>>>> >>>>>>> >>>>>>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we >>>>>>> cannot have a request body. That's actually why we expressed that as an >>>>>>> ID >>>>>>> string, since we can put it as a query parameter. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>> Jack, >>>>>>> >>>>>>> It sounds like what I’m proposing isn’t quite clear because your >>>>>>> initial response was arguing for a sharding capability. I agree that >>>>>>> sharding is a good idea. I’m less confident about two points: >>>>>>> >>>>>>> 1. Requiring that the service is stateful. As Renjie pointed >>>>>>> out, that makes it harder to scale the service. >>>>>>> 2. The need for both pagination *and* sharding as separate >>>>>>> things >>>>>>> >>>>>>> And I also think that Fokko has a good point about trying to keep >>>>>>> things simple and not requiring the CreateScan endpoint. >>>>>>> >>>>>>> For the first point, I’m proposing that we still have a CreateScan >>>>>>> endpoint, but instead of sending only a list of shard IDs it can also >>>>>>> send >>>>>>> either a standard shard “task” or an optional JSON definition. Let’s >>>>>>> assume >>>>>>> we can send arbitrary JSON for an example. Say I have a table with 4 >>>>>>> manifests, A through D and that C and D match some query filter. >>>>>>> When I call the CreateScan endpoint, the service would send back >>>>>>> tasks with that information: {"id": 1, "manifests": ["C"]}, {"id": >>>>>>> 2, "manifests": ["D"]}. By sending what the shards mean (the >>>>>>> manifests to read), my service can be stateless: any node can get a >>>>>>> request >>>>>>> for shard 1, read manifest C, and send back the resulting data >>>>>>> files. >>>>>>> >>>>>>> I don’t see much of an argument against doing this *in principle*. >>>>>>> It gives you the flexibility to store state if you choose or to send >>>>>>> state >>>>>>> to the client for it to pass back when calling the GetTasks >>>>>>> endpoint. There is a practical problem, which is that it’s annoying to >>>>>>> send >>>>>>> a GET request with a JSON payload because you can’t send a request body. >>>>>>> It’s probably obvious, but I’m also not a REST purist so I’d be fine >>>>>>> using >>>>>>> POST or QUERY for this. It would look something like this Gist >>>>>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>. >>>>>>> >>>>>>> In your last reply, you also asked whether a stateless service is a >>>>>>> goal. I don’t think that it is, but if we can make simple changes to the >>>>>>> spec to allow more flexibility on the server side, I think that’s a good >>>>>>> direction. You also asked about a reference implementation and I >>>>>>> consider >>>>>>> CatalogHandlers >>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java> >>>>>>> to be that reference. It does everything except for the work done by >>>>>>> your >>>>>>> choice of web application framework. It isn’t stateless, but it only >>>>>>> relies >>>>>>> on a Catalog implementation for persistence. >>>>>>> >>>>>>> For the second point, I don’t understand why we need both sharding >>>>>>> and pagination. That is, if we have a protocol that allows sharding, >>>>>>> why is >>>>>>> pagination also needed? From my naive perspective on how sharding would >>>>>>> work, we should be able to use metadata from the manifest list to limit >>>>>>> the >>>>>>> potential number of data files in a given shard. As long as we can limit >>>>>>> the size of a shard to produce more, pagination seems like unnecessary >>>>>>> complication. >>>>>>> >>>>>>> Lastly, for Fokko’s point, I think another easy extension to the >>>>>>> proposal is to support a direct call to GetTasks. There’s a >>>>>>> trade-off here, but if you’re already sending the original filter along >>>>>>> with the request (in order to filter records from manifest C for >>>>>>> instance) then the request is already something the protocol can >>>>>>> express. >>>>>>> There’s an objection concerning resource consumption on the service and >>>>>>> creating responses that are too large or take too long, but we can get >>>>>>> around that by responding with a code that instructs the client to use >>>>>>> the >>>>>>> CreateScan API like 413 (Payload too large). I think that would >>>>>>> allow simple clients to function for all but really large tables. The >>>>>>> gist >>>>>>> above also shows what this might look like. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <yezhao...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> The current proposal definitely makes the server stateful. In our >>>>>>> prototype we used other components like DynamoDB to keep track of >>>>>>> states. >>>>>>> If keeping it stateless is a tenant we can definitely make the proposal >>>>>>> closer to that direction. Maybe one thing to make sure is, is this a >>>>>>> core >>>>>>> tenant of the REST spec? Today we do not even have an official reference >>>>>>> implementation of the REST server, I feel it is hard to say what are the >>>>>>> core tenants. Maybe we should create one? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Pagination is a common issue in the REST spec. We also see similar >>>>>>> limitations with other APIs like GetTables, GetNamespaces. When a >>>>>>> catalog >>>>>>> has many namespaces and tables it suffers from the same issue. It is >>>>>>> also >>>>>>> not ideal for use cases like web browsers, since typically you display a >>>>>>> small page of results and do not need the full list immediately. So I >>>>>>> feel >>>>>>> we cannot really avoid some state to be kept for those use cases. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Chunked response might be a good way to work around it. We also >>>>>>> thought about using HTTP2. However, these options seem to be not very >>>>>>> compatible with OpenAPI. We can do some further research in this domain, >>>>>>> would really appreciate it if anyone has more insights and experience >>>>>>> with >>>>>>> OpenAPI that can provide some suggestions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, Rahi and Jack: >>>>>>> >>>>>>> Thanks for raising this. >>>>>>> >>>>>>> >>>>>>> >>>>>>> My question is that the pagination and sharding will make the rest >>>>>>> server stateful, e.g. a sequence of calls is required to go to the same >>>>>>> server. In this case, how do we ensure the scalability of the rest >>>>>>> server? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>> Hey Rahil and Jack, >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks for bringing this up. Ryan and I also discussed this briefly >>>>>>> in the early days of PyIceberg and it would have helped a lot in the >>>>>>> speed >>>>>>> of development. We went for the traditional approach because that would >>>>>>> also support all the other catalogs, but now that the REST catalog is >>>>>>> taking off, I think it still makes a lot of sense to get it in. >>>>>>> >>>>>>> >>>>>>> >>>>>>> I do share the concern raised Ryan around the concepts of shards and >>>>>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are >>>>>>> living in a single process today the concept of shards doesn't add >>>>>>> value. I >>>>>>> see your concern with long-running jobs, but for the non-distributed >>>>>>> cases, >>>>>>> it will add additional complexity. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Some suggestions that come to mind: >>>>>>> >>>>>>> - Stream the tasks directly back using a chunked >>>>>>> response, reducing the latency to the first task. This would also >>>>>>> solve >>>>>>> things with the pagination. The only downside I can think of is >>>>>>> having >>>>>>> delete files where you first need to make sure there are deletes >>>>>>> relevant >>>>>>> to the task, this might increase latency to the first task. >>>>>>> - Making the sharding optional. If you want to shard you call >>>>>>> the CreateScan first and then call the GetScanTask with the IDs. If >>>>>>> you >>>>>>> don't want to shard, you omit the shard parameter and fetch the tasks >>>>>>> directly (here we need also replace the scan string with the full >>>>>>> column/expression/snapshot-id etc). >>>>>>> >>>>>>> Looking forward to discussing this tomorrow in the community sync >>>>>>> <https://iceberg.apache.org/community/#iceberg-community-events>! >>>>>>> >>>>>>> >>>>>>> >>>>>>> Kind regards, >>>>>>> >>>>>>> Fokko >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>: >>>>>>> >>>>>>> Hi Ryan, thanks for the feedback! >>>>>>> >>>>>>> >>>>>>> >>>>>>> I was a part of this design discussion internally and can provide >>>>>>> more details. One reason for separating the CreateScan operation was to >>>>>>> make the API asynchronous and thus keep HTTP communications short. >>>>>>> Consider >>>>>>> the case where we only have GetScanTasks API, and there is no shard >>>>>>> specified. It might take tens of seconds, or even minutes to read >>>>>>> through >>>>>>> all the manifest list and manifests before being able to return >>>>>>> anything. >>>>>>> This means the HTTP connection has to remain open during that period, >>>>>>> which >>>>>>> is not really a good practice in general (consider connection failure, >>>>>>> load >>>>>>> balancer and proxy load, etc.). And when we shift the API to >>>>>>> asynchronous, >>>>>>> it basically becomes something like the proposal, where a stateful ID is >>>>>>> generated to be able to immediately return back to the client, and the >>>>>>> client get results by referencing the ID. So in our current prototype >>>>>>> implementation we are actually keeping this ID and the whole REST >>>>>>> service >>>>>>> is stateful. >>>>>>> >>>>>>> >>>>>>> >>>>>>> There were some thoughts we had about the possibility to define a >>>>>>> "shard ID generator" protocol: basically the client agrees with the >>>>>>> service >>>>>>> a way to deterministically generate shard IDs, and service uses it to >>>>>>> create shards. That sounds like what you are suggesting here, and it >>>>>>> pushes >>>>>>> the responsibility to the client side to determine the parallelism. But >>>>>>> in >>>>>>> some bad cases (e.g. there are many delete files and we need to read all >>>>>>> those in each shard to apply filters), it seems like there might still >>>>>>> be >>>>>>> the long open connection issue above. What is your thought on that? >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>> Rahil, thanks for working on this. It has some really good ideas >>>>>>> that we hadn't considered before like a way for the service to plan how >>>>>>> to >>>>>>> break up the work of scan planning. I really like that idea because it >>>>>>> makes it much easier for the service to keep memory consumption low >>>>>>> across >>>>>>> requests. >>>>>>> >>>>>>> >>>>>>> >>>>>>> My primary feedback is that I think it's a little too complicated >>>>>>> (with both sharding and pagination) and could be modified slightly so >>>>>>> that >>>>>>> the service doesn't need to be stateful. If the service isn't >>>>>>> necessarily >>>>>>> stateful then it should be easier to build implementations. >>>>>>> >>>>>>> >>>>>>> >>>>>>> To make it possible for the service to be stateless, I'm proposing >>>>>>> that rather than creating shard IDs that are tracked by the service, the >>>>>>> information for a shard can be sent to the client. My assumption here is >>>>>>> that most implementations would create shards by reading the manifest >>>>>>> list, >>>>>>> filtering on partition ranges, and creating a shard for some reasonable >>>>>>> size of manifest content. For example, if a table has 100MB of metadata >>>>>>> in >>>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards >>>>>>> with >>>>>>> 1-4 manifests each. The service could send those shards to the client >>>>>>> as a >>>>>>> list of manifests to read and the client could send the shard >>>>>>> information >>>>>>> back to the service to get the data files in each shard (along with the >>>>>>> original filter). >>>>>>> >>>>>>> >>>>>>> >>>>>>> There's a slight trade-off that the protocol needs to define how to >>>>>>> break the work into shards. I'm interested in hearing if that would work >>>>>>> with how you were planning on building the service on your end. Another >>>>>>> option is to let the service send back arbitrary JSON that would get >>>>>>> returned for each shard. Either way, I like that this would make it so >>>>>>> the >>>>>>> service doesn't need to persist anything. We could also make it so that >>>>>>> small tables don't require multiple requests. For example, a client >>>>>>> could >>>>>>> call the route to get file tasks with just a filter. >>>>>>> >>>>>>> >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil >>>>>>> <rcher...@amazon.com.invalid> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at >>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new >>>>>>> Scan >>>>>>> API that will be utilized by the RESTCatalog. The process for table scan >>>>>>> planning is currently done within client engines such as Apache Spark. >>>>>>> By >>>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg >>>>>>> table scans with external services, which can lead to several benefits. >>>>>>> >>>>>>> For example, we can leverage caching and indexes on the server side >>>>>>> to improve planning performance. Furthermore, by moving this scan logic >>>>>>> to >>>>>>> the RESTCatalog, non-JVM engines can integrate more easily. This all >>>>>>> can be >>>>>>> found in the detailed proposal below. Feel free to comment, and add your >>>>>>> suggestions . >>>>>>> >>>>>>> Detailed proposal: >>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>>>>> >>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252 >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Rahil Chertara >>>>>>> Amazon EMR & Athena >>>>>>> rcher...@amazon.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Ryan Blue >>>>>>> >>>>>>> Tabular >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Ryan Blue >>>>>>> >>>>>>> Tabular >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Ryan Blue >>>>>>> >>>>>>> Tabular >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Ryan Blue >>>>>>> >>>>>>> Tabular >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Ryan Blue >>>>>>> >>>>>>> Tabular >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular