Re: Flight Python EC2 Server for parquet on S3

Bruno Murino Fri, 10 May 2024 11:45:36 -0700

Hi,

It’s my first time here on this mailing list as well.


Regarding the EC2 instance size, I wonder if you’re hitting the IOPS limits of 
the T instance, give the large volumes of data coming out?

I could be way off, though, but that’s where my mind went.

Cheers,

Bruno Murino

> On 10 May 2024, at 19:11, Bryce Mecum <bryceme...@gmail.com> wrote:
> 
> Hi Christian, welcome.
> 
> Your code looks reasonable to me at first glance. It does seem
> possible you're resource-constrained with that t2.micro instance. You
> might try using a larger instance or reducing the batch size in your
> call to iter_batches [1] to some very small number.
> 
> [1] 
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches
> 
>> On Fri, May 10, 2024 at 7:30 AM Christian Casazza
>> <christiancasazz...@gmail.com> wrote:
>> 
>> Hello everyone,
>> 
>> This is my first time emailing this mailing list, so I hope I am explaining
>> things correctly below.
>> 
>> I am attempting to get started with Arrow Flight. I am storing parquet
>> files and Iceberg tables on S3. I would like to use arrow flight as the
>> interface data consumers use to access my data so they always receive Arrow
>> back, where they can then continue to iterate locally with DuckDB, polars,
>> etc.
>> 
>> I am first attempting to get it working with a single parquet file in a
>> private bucket on S3. For this test, I am just putting the credentials and
>> paths directly in the server code, after working I can move to env before
>> production.
>> 
>> The parquet file is about 0.6GB. I am running the EC2 on a t2.micro
>> instance.
>> 
>> I was originally running into an ACCESS_DENIED during HeadObject operation
>> AWS error when attempting to get the flight_info metadata about the file.
>> From this issue <https://github.com/apache/arrow/issues/37888>, I added in
>> using s3fs, and I was able to avoid the HeadObject error. So, the client is
>> able to successfully see the available datasets, and return the schema.
>> 
>> When I attempt to actually download the data itself, it is causing my EC2
>> instance to break down and my SSH connection to drop. Is this likely a
>> memory issue, or something with my code?
>> 
>> 
>> The goal is to provide users with a common interface to access my data.
>> After getting this working, I would add more datasets, data sources,
>> introduce auth and RBAC, etc. For now, I thought this was a good base
>> starting point. For now, I am just going with the user downloads the entire
>> dataset. In the future, I hope to figure out an easy interface to support
>> more fine grained data/tablescans, or supporting a query first, to return
>> desired data.
>> 
>> To keep things simple, I just added my code here
>> <https://github.com/ChristianCasazza/arrowflights3example>.(
>> https://github.com/ChristianCasazza/arrowflights3example).
>> When I was actually testing, I connected to the EC2 instance through VScode
>> for the server, and I was running the client code locally in a different
>> window. I removed my actual parquet file path and credentials.
>> 
>> 
>> This is my first time working with Arrow Flight, so I apologize if I am
>> overlooking something simple or if the answer was in the docs.
>> 
>> Any suggestions for changes I can make to get the data download working, or
>> clear errors I am making?
>> 
>> Thank you!
>> 
>> Best,
>> Christian Casazza

Re: Flight Python EC2 Server for parquet on S3

Reply via email to