Re: Proposal for REST APIs for Iceberg table scans

Renjie Liu Thu, 21 Dec 2023 22:35:11 -0800

I share the same concern with Micah. The shard detail should be
implementation details of the server, rather than exposing directly to the
client. If the goal is to make things stateless, we just need to attach a
snapshot id + shard id, then a determined algorithm is supposed to give the
same result. Also another concern is for huge analytics tables, we may have
a lot of manifest files, which may lead to large traffic from the rest
server.


On Thu, Dec 21, 2023 at 7:41 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Also +1 for having a more strict definition of the shard. Having arbitrary
>> JSON was basically what we experimented with a string shard ID, and we
>> ended up with something very similar to the manifest plan task you describe
>> in the serialized ID string.
>
>
> IIUC the proposal correctly, I'd actually be -0.0 on the stricter
> structure.  I think forcing a contract where it isn't strictly necessary
> makes it harder to evolve the system in the future.  For example it makes
> it harder to address potential scalability problems in a transparent way
> (e.g. extreme edge cases in cardinality between manifest files and delete
> files).
>
> It also seems like it might overly constrain implementations (it is not
> clear we should need to compute the mapping between delete file manifests
> to data file manifests up front to start planning).
>
> On Tue, Dec 19, 2023 at 2:10 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> +1 for having /plan and /scan, sounds like a good idea to separate those
>> 2 distinct actions.
>>
>> Also +1 for having a more strict definition of the shard. Having
>> arbitrary JSON was basically what we experimented with a string shard ID,
>> and we ended up with something very similar to the manifest plan task you
>> describe in the serialized ID string.
>>
>> So sounds like we are converging to the following APIs:
>>
>>
>> *POST /v1/namespaces/ns/tables/t/plan*{ "filter": { "type": "in",
>> "term": "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>
>> { "manifest-plan-tasks": [
>>   { "start": 0, "length": 1000, "manifest": { "path":
>> "s3://some/manifest.avro", ...}, "delete-manifests": [...] },
>>   { ... }
>> ]}
>>
>>
>> *POST /v1/namespaces/ns/tables/t/scan*{ "filter": {"type": "in", "term":
>> "x", "values": [1, 2, 3] },
>>   "select": ["x", "a.b"],
>>   "manifest-plan-task": { "start": 0, "length": 1000, "manifest": {
>> "path": "s3://some/manifest.avro", ...}, "delete-manifests": [...] } }
>>
>> { "file-scan-tasks": [...] }
>>
>>
>> *POST /v1/namespaces/ns/tables/t/scan*{ "filter": {"type": "in", "term":
>> "x", "values": [1, 2, 3] }, "select": ["x", "a.b"]}
>>
>> { "file-scan-tasks": [...] }
>>
>>
>> If this sounds good overall, we can update the prototype to have more
>> detailed discussions in code.
>>
>> -Jack
>>
>>
>>
>>
>>
>> On Thu, Dec 14, 2023 at 6:10 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> The tasks might look something like this:
>>>
>>> CombinedPlanTask
>>> - List<ManifestPlanTask>
>>>
>>> ManifestPlanTask
>>> - int start
>>> - int length
>>> - ManifestFile dataManifest
>>> - List<ManifestFile> deleteManifests
>>>
>>> On Thu, Dec 14, 2023 at 4:07 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Seems like that track has expired (This Internet-Draft will expire on
>>>> 13 May 2022)
>>>>
>>>> Yeah, looks like we should just use POST. That’s too bad. QUERY seems
>>>> like a good idea to me.
>>>>
>>>> Distinguish planning using shard or not
>>>>
>>>> I think this was a mistake on my part. I was still thinking that we
>>>> would have a different endpoint for first-level planning to produce shards
>>>> and the route to actually get files. Since both are POST requests with the
>>>> same path (/v1/namespaces/ns/tables/t/scans) that no longer works.
>>>> What about /v1/namespaces/ns/tables/t/scan and
>>>> /v1/namespaces/ns/tables/t/plan? The latter could use some variant of
>>>> planFiles since that’s what we are wrapping in the Java API.
>>>>
>>>> Necessity of scan ID
>>>>
>>>> Yes, I agree. If you have shard IDs then you don’t really need a scan
>>>> ID. You could always have one internally but send it as part of the shard
>>>> ID.
>>>>
>>>> Shape of shard payload
>>>>
>>>> I think we have 2 general options depending on how strict we want to be.
>>>>
>>>>    1. Require a standard shard definition
>>>>    2. Allow arbitrary JSON and leave it to the service
>>>>
>>>> I lean toward the first option, which would be a data manifest and the
>>>> associated delete manifests for the partition. We could also extend that to
>>>> a group of manifests, each with a list of delete manifests. And we could
>>>> also allow splitting to ensure tasks don’t get too large with big files.
>>>> This all looks basically like FileScanTask, but with manifests and delete
>>>> manifests.
>>>>
>>>> On Wed, Dec 13, 2023 at 4:39 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> Seems like that track has expired (This Internet-Draft will expire on
>>>>> 13 May 2022), not sure how these RFCs are managed, but it does not seem
>>>>> hopeful to have this verb in. I think people are mostly using POST for 
>>>>> this
>>>>> use case already.
>>>>>
>>>>> But overall I think we are in agreement with the general direction. A
>>>>> few detail discussions:
>>>>>
>>>>> *Distinguish planning using shard or not*
>>>>> Maybe we should add a query parameter like *distributed=true* to
>>>>> distinguish your first and third case, since they are now sharing the same
>>>>> signature. If the requester wants to use distributed planning, then some
>>>>> sharding strategy is provided as a response for the requester to send more
>>>>> detailed requests.
>>>>>
>>>>> *Necessity of scan ID*
>>>>> In this approach, is scan ID still required? Because the shard payload
>>>>> already fully describes the information to retrieve, it seems like we can
>>>>> just drop the *scan-id* query parameter in the second case. Seems
>>>>> like it's kept for the case if we still want to persist some state, but it
>>>>> seems like we can make a stateless style fully working.
>>>>>
>>>>> *Shape of shard payload*
>>>>> What do you think is necessary information of the shard payload? It
>>>>> seems like we need at least the location of the manifests, plus the delete
>>>>> manifests or delete files associated with the manifests. I like the idea 
>>>>> of
>>>>> making it a "shard task" that is similar to a file scan task, and it might
>>>>> allow us to return a mixture of both types of tasks, so we can have better
>>>>> control of the response size.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Wed, Dec 13, 2023 at 3:50 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> I just changed it to POST after looking into support for the QUERY
>>>>>> method. It's a new HTTP method for cases like this where you don't want 
>>>>>> to
>>>>>> pass everything through query params. Here's the QUERY method RFC
>>>>>> <https://www.ietf.org/archive/id/draft-ietf-httpbis-safe-method-w-body-02.html>,
>>>>>> but I guess it isn't finalized yet?
>>>>>>
>>>>>> Just read them like you would a POST request that doesn't actually
>>>>>> create anything.
>>>>>>
>>>>>> On Wed, Dec 13, 2023 at 3:45 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks, the Gist explains a lot of things. This is actually very
>>>>>>> close to our way of implementing the shard ID, we were defining the 
>>>>>>> shard
>>>>>>> ID as a string, and the string content is actually something similar to 
>>>>>>> the
>>>>>>> information of the JSON payload you showed, so we can persist minimum
>>>>>>> information in storage.
>>>>>>>
>>>>>>> Just one clarification needed for your Gist:
>>>>>>>
>>>>>>> > QUERY /v1/namespaces/ns/tables/t/scans?scan-id=1
>>>>>>> > { "shard": { "id": 1, "manifests": ["C"] }, "filter": {"type": "in
>>>>>>> ", "term": "x", "values": [1, 2, 3] } }
>>>>>>> >
>>>>>>> > { "file-scan-tasks": [...] }
>>>>>>>
>>>>>>> Here, what does this QUERY verb mean? Is that a GET? If it's GET, we
>>>>>>> cannot have a request body. That's actually why we expressed that as an 
>>>>>>> ID
>>>>>>> string, since we can put it as a query parameter.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 13, 2023 at 3:25 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Jack,
>>>>>>>>
>>>>>>>> It sounds like what I’m proposing isn’t quite clear because your
>>>>>>>> initial response was arguing for a sharding capability. I agree that
>>>>>>>> sharding is a good idea. I’m less confident about two points:
>>>>>>>>
>>>>>>>>    1. Requiring that the service is stateful. As Renjie pointed
>>>>>>>>    out, that makes it harder to scale the service.
>>>>>>>>    2. The need for both pagination *and* sharding as separate
>>>>>>>>    things
>>>>>>>>
>>>>>>>> And I also think that Fokko has a good point about trying to keep
>>>>>>>> things simple and not requiring the CreateScan endpoint.
>>>>>>>>
>>>>>>>> For the first point, I’m proposing that we still have a CreateScan
>>>>>>>> endpoint, but instead of sending only a list of shard IDs it can also 
>>>>>>>> send
>>>>>>>> either a standard shard “task” or an optional JSON definition. Let’s 
>>>>>>>> assume
>>>>>>>> we can send arbitrary JSON for an example. Say I have a table with 4
>>>>>>>> manifests, A through D and that C and D match some query filter.
>>>>>>>> When I call the CreateScan endpoint, the service would send back
>>>>>>>> tasks with that information: {"id": 1, "manifests": ["C"]}, {"id":
>>>>>>>> 2, "manifests": ["D"]}. By sending what the shards mean (the
>>>>>>>> manifests to read), my service can be stateless: any node can get a 
>>>>>>>> request
>>>>>>>> for shard 1, read manifest C, and send back the resulting data
>>>>>>>> files.
>>>>>>>>
>>>>>>>> I don’t see much of an argument against doing this *in principle*.
>>>>>>>> It gives you the flexibility to store state if you choose or to send 
>>>>>>>> state
>>>>>>>> to the client for it to pass back when calling the GetTasks
>>>>>>>> endpoint. There is a practical problem, which is that it’s annoying to 
>>>>>>>> send
>>>>>>>> a GET request with a JSON payload because you can’t send a request 
>>>>>>>> body.
>>>>>>>> It’s probably obvious, but I’m also not a REST purist so I’d be fine 
>>>>>>>> using
>>>>>>>> POST or QUERY for this. It would look something like this Gist
>>>>>>>> <https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>.
>>>>>>>>
>>>>>>>> In your last reply, you also asked whether a stateless service is a
>>>>>>>> goal. I don’t think that it is, but if we can make simple changes to 
>>>>>>>> the
>>>>>>>> spec to allow more flexibility on the server side, I think that’s a 
>>>>>>>> good
>>>>>>>> direction. You also asked about a reference implementation and I 
>>>>>>>> consider
>>>>>>>> CatalogHandlers
>>>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java>
>>>>>>>> to be that reference. It does everything except for the work done by 
>>>>>>>> your
>>>>>>>> choice of web application framework. It isn’t stateless, but it only 
>>>>>>>> relies
>>>>>>>> on a Catalog implementation for persistence.
>>>>>>>>
>>>>>>>> For the second point, I don’t understand why we need both sharding
>>>>>>>> and pagination. That is, if we have a protocol that allows sharding, 
>>>>>>>> why is
>>>>>>>> pagination also needed? From my naive perspective on how sharding would
>>>>>>>> work, we should be able to use metadata from the manifest list to 
>>>>>>>> limit the
>>>>>>>> potential number of data files in a given shard. As long as we can 
>>>>>>>> limit
>>>>>>>> the size of a shard to produce more, pagination seems like unnecessary
>>>>>>>> complication.
>>>>>>>>
>>>>>>>> Lastly, for Fokko’s point, I think another easy extension to the
>>>>>>>> proposal is to support a direct call to GetTasks. There’s a
>>>>>>>> trade-off here, but if you’re already sending the original filter along
>>>>>>>> with the request (in order to filter records from manifest C for
>>>>>>>> instance) then the request is already something the protocol can 
>>>>>>>> express.
>>>>>>>> There’s an objection concerning resource consumption on the service and
>>>>>>>> creating responses that are too large or take too long, but we can get
>>>>>>>> around that by responding with a code that instructs the client to use 
>>>>>>>> the
>>>>>>>> CreateScan API like 413 (Payload too large). I think that would
>>>>>>>> allow simple clients to function for all but really large tables. The 
>>>>>>>> gist
>>>>>>>> above also shows what this might look like.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <yezhao...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The current proposal definitely makes the server stateful. In our
>>>>>>>>> prototype we used other components like DynamoDB to keep track of 
>>>>>>>>> states.
>>>>>>>>> If keeping it stateless is a tenant we can definitely make the 
>>>>>>>>> proposal
>>>>>>>>> closer to that direction. Maybe one thing to make sure is, is this a 
>>>>>>>>> core
>>>>>>>>> tenant of the REST spec? Today we do not even have an official 
>>>>>>>>> reference
>>>>>>>>> implementation of the REST server, I feel it is hard to say what are 
>>>>>>>>> the
>>>>>>>>> core tenants. Maybe we should create one?
>>>>>>>>>
>>>>>>>>> Pagination is a common issue in the REST spec. We also see similar
>>>>>>>>> limitations with other APIs like GetTables, GetNamespaces. When a 
>>>>>>>>> catalog
>>>>>>>>> has many namespaces and tables it suffers from the same issue. It is 
>>>>>>>>> also
>>>>>>>>> not ideal for use cases like web browsers, since typically you 
>>>>>>>>> display a
>>>>>>>>> small page of results and do not need the full list immediately. So I 
>>>>>>>>> feel
>>>>>>>>> we cannot really avoid some state to be kept for those use cases.
>>>>>>>>>
>>>>>>>>> Chunked response might be a good way to work around it. We also
>>>>>>>>> thought about using HTTP2. However, these options seem to be not very
>>>>>>>>> compatible with OpenAPI. We can do some further research in this 
>>>>>>>>> domain,
>>>>>>>>> would really appreciate it if anyone has more insights and experience 
>>>>>>>>> with
>>>>>>>>> OpenAPI that can provide some suggestions.
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <
>>>>>>>>> liurenjie2...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, Rahi and Jack:
>>>>>>>>>>
>>>>>>>>>> Thanks for raising this.
>>>>>>>>>>
>>>>>>>>>> My question is that the pagination and sharding will make the
>>>>>>>>>> rest server stateful, e.g. a sequence of calls is required to go to 
>>>>>>>>>> the
>>>>>>>>>> same server. In this case, how do we ensure the scalability of the 
>>>>>>>>>> rest
>>>>>>>>>> server?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <
>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Rahil and Jack,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up. Ryan and I also discussed this
>>>>>>>>>>> briefly in the early days of PyIceberg and it would have helped a 
>>>>>>>>>>> lot in
>>>>>>>>>>> the speed of development. We went for the traditional approach 
>>>>>>>>>>> because that
>>>>>>>>>>> would also support all the other catalogs, but now that the REST 
>>>>>>>>>>> catalog is
>>>>>>>>>>> taking off, I think it still makes a lot of sense to get it in.
>>>>>>>>>>>
>>>>>>>>>>> I do share the concern raised Ryan around the concepts of shards
>>>>>>>>>>> and pagination. For PyIceberg (but also for Go, Rust, and DuckDB) 
>>>>>>>>>>> that are
>>>>>>>>>>> living in a single process today the concept of shards doesn't add 
>>>>>>>>>>> value. I
>>>>>>>>>>> see your concern with long-running jobs, but for the 
>>>>>>>>>>> non-distributed cases,
>>>>>>>>>>> it will add additional complexity.
>>>>>>>>>>>
>>>>>>>>>>> Some suggestions that come to mind:
>>>>>>>>>>>
>>>>>>>>>>>    - Stream the tasks directly back using a chunked
>>>>>>>>>>>    response, reducing the latency to the first task. This would 
>>>>>>>>>>> also solve
>>>>>>>>>>>    things with the pagination. The only downside I can think of is 
>>>>>>>>>>> having
>>>>>>>>>>>    delete files where you first need to make sure there are deletes 
>>>>>>>>>>> relevant
>>>>>>>>>>>    to the task, this might increase latency to the first task.
>>>>>>>>>>>    - Making the sharding optional. If you want to shard you
>>>>>>>>>>>    call the CreateScan first and then call the GetScanTask with the 
>>>>>>>>>>> IDs. If
>>>>>>>>>>>    you don't want to shard, you omit the shard parameter and fetch 
>>>>>>>>>>> the tasks
>>>>>>>>>>>    directly (here we need also replace the scan string with the full
>>>>>>>>>>>    column/expression/snapshot-id etc).
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to discussing this tomorrow in the community
>>>>>>>>>>> sync
>>>>>>>>>>> <https://iceberg.apache.org/community/#iceberg-community-events>
>>>>>>>>>>> !
>>>>>>>>>>>
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> Fokko
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com
>>>>>>>>>>> >:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ryan, thanks for the feedback!
>>>>>>>>>>>>
>>>>>>>>>>>> I was a part of this design discussion internally and can
>>>>>>>>>>>> provide more details. One reason for separating the CreateScan 
>>>>>>>>>>>> operation
>>>>>>>>>>>> was to make the API asynchronous and thus keep HTTP communications 
>>>>>>>>>>>> short.
>>>>>>>>>>>> Consider the case where we only have GetScanTasks API, and there 
>>>>>>>>>>>> is no
>>>>>>>>>>>> shard specified. It might take tens of seconds, or even minutes to 
>>>>>>>>>>>> read
>>>>>>>>>>>> through all the manifest list and manifests before being able to 
>>>>>>>>>>>> return
>>>>>>>>>>>> anything. This means the HTTP connection has to remain open during 
>>>>>>>>>>>> that
>>>>>>>>>>>> period, which is not really a good practice in general (consider 
>>>>>>>>>>>> connection
>>>>>>>>>>>> failure, load balancer and proxy load, etc.). And when we shift 
>>>>>>>>>>>> the API to
>>>>>>>>>>>> asynchronous, it basically becomes something like the proposal, 
>>>>>>>>>>>> where a
>>>>>>>>>>>> stateful ID is generated to be able to immediately return back to 
>>>>>>>>>>>> the
>>>>>>>>>>>> client, and the client get results by referencing the ID. So in 
>>>>>>>>>>>> our current
>>>>>>>>>>>> prototype implementation we are actually keeping this ID and the 
>>>>>>>>>>>> whole REST
>>>>>>>>>>>> service is stateful.
>>>>>>>>>>>>
>>>>>>>>>>>> There were some thoughts we had about the possibility to define
>>>>>>>>>>>> a "shard ID generator" protocol: basically the client agrees with 
>>>>>>>>>>>> the
>>>>>>>>>>>> service a way to deterministically generate shard IDs, and service 
>>>>>>>>>>>> uses it
>>>>>>>>>>>> to create shards. That sounds like what you are suggesting here, 
>>>>>>>>>>>> and it
>>>>>>>>>>>> pushes the responsibility to the client side to determine the 
>>>>>>>>>>>> parallelism.
>>>>>>>>>>>> But in some bad cases (e.g. there are many delete files and we 
>>>>>>>>>>>> need to read
>>>>>>>>>>>> all those in each shard to apply filters), it seems like there 
>>>>>>>>>>>> might still
>>>>>>>>>>>> be the long open connection issue above. What is your thought on 
>>>>>>>>>>>> that?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Rahil, thanks for working on this. It has some really good
>>>>>>>>>>>>> ideas that we hadn't considered before like a way for the service 
>>>>>>>>>>>>> to plan
>>>>>>>>>>>>> how to break up the work of scan planning. I really like that 
>>>>>>>>>>>>> idea because
>>>>>>>>>>>>> it makes it much easier for the service to keep memory 
>>>>>>>>>>>>> consumption low
>>>>>>>>>>>>> across requests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My primary feedback is that I think it's a little too
>>>>>>>>>>>>> complicated (with both sharding and pagination) and could be 
>>>>>>>>>>>>> modified
>>>>>>>>>>>>> slightly so that the service doesn't need to be stateful. If the 
>>>>>>>>>>>>> service
>>>>>>>>>>>>> isn't necessarily stateful then it should be easier to build
>>>>>>>>>>>>> implementations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To make it possible for the service to be stateless, I'm
>>>>>>>>>>>>> proposing that rather than creating shard IDs that are tracked by 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> service, the information for a shard can be sent to the client. My
>>>>>>>>>>>>> assumption here is that most implementations would create shards 
>>>>>>>>>>>>> by reading
>>>>>>>>>>>>> the manifest list, filtering on partition ranges, and creating a 
>>>>>>>>>>>>> shard for
>>>>>>>>>>>>> some reasonable size of manifest content. For example, if a table 
>>>>>>>>>>>>> has 100MB
>>>>>>>>>>>>> of metadata in 25 manifests that are about 4 MB each, then it 
>>>>>>>>>>>>> might create
>>>>>>>>>>>>> 9 shards with 1-4 manifests each. The service could send those 
>>>>>>>>>>>>> shards to
>>>>>>>>>>>>> the client as a list of manifests to read and the client could 
>>>>>>>>>>>>> send the
>>>>>>>>>>>>> shard information back to the service to get the data files in 
>>>>>>>>>>>>> each shard
>>>>>>>>>>>>> (along with the original filter).
>>>>>>>>>>>>>
>>>>>>>>>>>>> There's a slight trade-off that the protocol needs to define
>>>>>>>>>>>>> how to break the work into shards. I'm interested in hearing if 
>>>>>>>>>>>>> that would
>>>>>>>>>>>>> work with how you were planning on building the service on your 
>>>>>>>>>>>>> end.
>>>>>>>>>>>>> Another option is to let the service send back arbitrary JSON 
>>>>>>>>>>>>> that would
>>>>>>>>>>>>> get returned for each shard. Either way, I like that this would 
>>>>>>>>>>>>> make it so
>>>>>>>>>>>>> the service doesn't need to persist anything. We could also make 
>>>>>>>>>>>>> it so that
>>>>>>>>>>>>> small tables don't require multiple requests. For example, a 
>>>>>>>>>>>>> client could
>>>>>>>>>>>>> call the route to get file tasks with just a filter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>>>>>>>>>>> <rcher...@amazon.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team
>>>>>>>>>>>>>> at Amazon EMR and Athena. I’m reaching out to share a proposal 
>>>>>>>>>>>>>> for a new
>>>>>>>>>>>>>> Scan API that will be utilized by the RESTCatalog. The process 
>>>>>>>>>>>>>> for table
>>>>>>>>>>>>>> scan planning is currently done within client engines such as 
>>>>>>>>>>>>>> Apache Spark.
>>>>>>>>>>>>>> By moving scan functionality to the RESTCatalog, we can 
>>>>>>>>>>>>>> integrate Iceberg
>>>>>>>>>>>>>> table scans with external services, which can lead to several 
>>>>>>>>>>>>>> benefits.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example, we can leverage caching and indexes on the
>>>>>>>>>>>>>> server side to improve planning performance. Furthermore, by 
>>>>>>>>>>>>>> moving this
>>>>>>>>>>>>>> scan logic to the RESTCatalog, non-JVM engines can integrate 
>>>>>>>>>>>>>> more easily.
>>>>>>>>>>>>>> This all can be found in the detailed proposal below. Feel free 
>>>>>>>>>>>>>> to comment,
>>>>>>>>>>>>>> and add your suggestions .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Detailed proposal:
>>>>>>>>>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Rahil Chertara
>>>>>>>>>>>>>> Amazon EMR & Athena
>>>>>>>>>>>>>> rcher...@amazon.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to