s3 partitioned data - delayed partition scan

Oxlade, Dan Wed, 29 Mar 2023 01:17:32 -0700

Hi all,

I'm fairly new to arrow.


I'm trying to create an Arrow Flight service that reads data from and s3 
bucket. On the face of it that appears to be quite simple. Unfortunately I have 
a very large bucket with 1000's of files across partitions.

I'm trying the following in python:

dataset = ds.dataset(
   f"{bucket}/{partition_root}/",
   filesystem=s3fs,
   partitioning=my_partitioning_def,
)
batches = dataset.to_batches(
   filter=my_filter_which_would_select_a_tiny_subset_of_files
)

>From my testing it seems as though the s3 bucket is scanned at the first step, 
>this is extremely inefficient in my use-case. Is there a way to delay the scan 
>until the filter is applied? This could reduce the scan of many 1000's of 
>objects to a single object in s3.

Hopefully that make sense.

Thanks
Dan
T. Rowe Price (including T. Rowe Price Group, Inc. and its affiliates) and its 
associates do not provide legal or tax advice.  Any tax-related discussion 
contained in this e-mail, including any attachments, is not intended or written 
to be used, and cannot be used, for the purpose of (i) avoiding any tax 
penalties or (ii) promoting, marketing, or recommending to any other party any 
transaction or matter addressed herein.  Please consult your independent legal 
counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.

s3 partitioned data - delayed partition scan

Reply via email to