alamb commented on issue #1836:
URL:
https://github.com/apache/arrow-datafusion/issues/1836#issuecomment-1042856074
It seems to me that if your goal is to basically do some object store calls,
get the list of files from S3 and then build a catalog from that snapshot the
memory provider / builder is probably the simplest route to take
However, if you want to do more sophisticated things (like, for example, not
traverse the s3 directory / prefix structure up front and do it more on demand)
a specialized implementation of `S3Catalog` might be more helpful
For example of "on demand" if you had object store like this:
```
s3://active/schema1/tableA
s3://active/schema1/tableB
s3://active/schema1/tableC
s3://active/schema2/tableD
s3://hist/...
...
```
You could make an `S3Catalog` for each of the first prefixes (`active` and
`hist`).
If you wrote a query that did like `SELECT * from active.schema1.tableA`
then
1. the `S3Catalog` for `active` would be asked for what schemas do you have
(it could ask object store and return `S3Schemas` for `schema1` and `schema2`
2. The `S3Schema` for `schema2` could then be asked for what tables do you
know, and it would ask object store and return table providers for `tableA`,
`tableB` and `tableC`
As you can imagine there are tradeoffs in the two approaches: the first will
take longer to setup but be much faster to query each time (but also won't see
any new files that appear in S3). The second will be very fast to setup and
will see new files that appear, but will make object store requests during
planning so will be slower.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]