Thank you Daniel for taking the time to go through the slides!

S3 select is an interesting beast, but I think the benefit we could draw
from it in this usecase is pretty limited:
- for now Buzz focuses on Parquet data, which already allows efficient
projection capabilities (it uses HTTP Range requests to download only the
relevant parts of the files) and once supported by datafusion, we might
even push down filters to skip downloading entire row groups.
- S3 select can only output CSV and JSON, so in the cases where you have to
bring back a lot of data, it would actually amplify the volumes of data
fetched from s3 and make the deserialization more expensive.

There are still some situations where S3 select would definitely be
beneficial, but it would be quite hard to automatically identify those and
let S3 Select kick accordingly.

Have you used S3 Select at scale? Does it provide good and consistent
latencies?

Le mer. 10 févr. 2021 à 19:35, Daniël Heres <danielhe...@gmail.com> a
écrit :

> Thanks for sharing the slides Rémi! That looks really cool.
>
> One question I have after this, do you plan to use S3 Select (
> https://aws.amazon.com/blogs/aws/s3-glacier-select/)?Seems it would fit
> your architecture nicely and I think shouldn't be too hard to create the
> query from the filters/projection in the datasource scan method to spend
> less time in Lambda.
>
> On Wed, Feb 10, 2021, 18:44 Rémi Dettai <rdet...@gmail.com> wrote:
>
> > Thanks for the notes Andy. Here is the slide deck I presented, for
> further
> > reference:
> >
> >
> https://docs.google.com/presentation/d/1uZ5PbazC1zCX24k0Hh-UItddIh9BRvD5GL7NUDgc9eQ/edit?usp=sharing
> >
> > If anyone wants to see how it works in practice and does not have an AWS
> > account to try it out, feel free to reach out to me and I can walk you
> > through it!
> >
> > Le mer. 10 févr. 2021 à 18:37, Andy Grove <andygrov...@gmail.com> a
> écrit
> > :
> >
> > > Attendees
> > >
> > >
> > >    -
> > >
> > >    Andy Grove
> > >    -
> > >
> > >    Benjamin Blodgett
> > >    -
> > >
> > >    Marc Prud’Hommeaux
> > >    -
> > >
> > >    Mike Seddon
> > >    -
> > >
> > >    Jorge Leitao
> > >    -
> > >
> > >    Andrew Lamb
> > >    -
> > >
> > >    Fernando Herrera
> > >    -
> > >
> > >    Neville Dipale
> > >    -
> > >
> > >    Remi Dettai
> > >
> > >
> > > (Please let me know if I have misspelled anyone’s names)
> > >
> > > Topics Discussed
> > >
> > >
> > >    -
> > >
> > >    Discussion of Jorge’s proposal to redesign Arrow crate to resolve
> > safety
> > >    violations (following on from mailing list discussion)
> > >    -
> > >
> > >    Mike has a PR up to implement a large number of Postgres string
> > >    functions that needs reviewing
> > >    -
> > >
> > >    Remi gave a short presentation about his Buzz project which provides
> > >    serverless compute using Arrow and DataFusion
> > >
> > >
> > > Planned for next time:
> > >
> > >
> > >    -
> > >
> > >    Marc Prud’Hommeaux to give a presentation/demo on his use of Arrow
> > >    -
> > >
> > >    Andy Grove to give a presentation/demo on Ballista, which provides
> > >    distributed query execution using DataFusion
> > >
> > >
> > > On Wed, Feb 10, 2021 at 8:56 AM Andy Grove <andygrov...@gmail.com>
> > wrote:
> > >
> > > > A quick reminder that the bi-weekly Arrow Rust sync call starts about
> > an
> > > > hour from now. Everyone is welcome.
> > > >
> > > > https://meet.google.com/ctp-yujs-aee
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > >
> >
>

Reply via email to