Hi, It’s my first time here on this mailing list as well.
Regarding the EC2 instance size, I wonder if you’re hitting the IOPS limits of the T instance, give the large volumes of data coming out? I could be way off, though, but that’s where my mind went. Cheers, Bruno Murino > On 10 May 2024, at 19:11, Bryce Mecum <bryceme...@gmail.com> wrote: > > Hi Christian, welcome. > > Your code looks reasonable to me at first glance. It does seem > possible you're resource-constrained with that t2.micro instance. You > might try using a larger instance or reducing the batch size in your > call to iter_batches [1] to some very small number. > > [1] > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches > >> On Fri, May 10, 2024 at 7:30 AM Christian Casazza >> <christiancasazz...@gmail.com> wrote: >> >> Hello everyone, >> >> This is my first time emailing this mailing list, so I hope I am explaining >> things correctly below. >> >> I am attempting to get started with Arrow Flight. I am storing parquet >> files and Iceberg tables on S3. I would like to use arrow flight as the >> interface data consumers use to access my data so they always receive Arrow >> back, where they can then continue to iterate locally with DuckDB, polars, >> etc. >> >> I am first attempting to get it working with a single parquet file in a >> private bucket on S3. For this test, I am just putting the credentials and >> paths directly in the server code, after working I can move to env before >> production. >> >> The parquet file is about 0.6GB. I am running the EC2 on a t2.micro >> instance. >> >> I was originally running into an ACCESS_DENIED during HeadObject operation >> AWS error when attempting to get the flight_info metadata about the file. >> From this issue <https://github.com/apache/arrow/issues/37888>, I added in >> using s3fs, and I was able to avoid the HeadObject error. So, the client is >> able to successfully see the available datasets, and return the schema. >> >> When I attempt to actually download the data itself, it is causing my EC2 >> instance to break down and my SSH connection to drop. Is this likely a >> memory issue, or something with my code? >> >> >> The goal is to provide users with a common interface to access my data. >> After getting this working, I would add more datasets, data sources, >> introduce auth and RBAC, etc. For now, I thought this was a good base >> starting point. For now, I am just going with the user downloads the entire >> dataset. In the future, I hope to figure out an easy interface to support >> more fine grained data/tablescans, or supporting a query first, to return >> desired data. >> >> To keep things simple, I just added my code here >> <https://github.com/ChristianCasazza/arrowflights3example>.( >> https://github.com/ChristianCasazza/arrowflights3example). >> When I was actually testing, I connected to the EC2 instance through VScode >> for the server, and I was running the client code locally in a different >> window. I removed my actual parquet file path and credentials. >> >> >> This is my first time working with Arrow Flight, so I apologize if I am >> overlooking something simple or if the answer was in the docs. >> >> Any suggestions for changes I can make to get the data download working, or >> clear errors I am making? >> >> Thank you! >> >> Best, >> Christian Casazza