Hi all,
I'm fairly new to arrow.
I'm trying to create an Arrow Flight service that reads data from and s3
bucket. On the face of it that appears to be quite simple. Unfortunately I have
a very large bucket with 1000's of files across partitions.
I'm trying the following in python:
dataset = ds.dataset(
f"{bucket}/{partition_root}/",
filesystem=s3fs,
partitioning=my_partitioning_def,
)
batches = dataset.to_batches(
filter=my_filter_which_would_select_a_tiny_subset_of_files
)
>From my testing it seems as though the s3 bucket is scanned at the first step,
>this is extremely inefficient in my use-case. Is there a way to delay the scan
>until the filter is applied? This could reduce the scan of many 1000's of
>objects to a single object in s3.
Hopefully that make sense.
Thanks
Dan
T. Rowe Price (including T. Rowe Price Group, Inc. and its affiliates) and its
associates do not provide legal or tax advice. Any tax-related discussion
contained in this e-mail, including any attachments, is not intended or written
to be used, and cannot be used, for the purpose of (i) avoiding any tax
penalties or (ii) promoting, marketing, or recommending to any other party any
transaction or matter addressed herein. Please consult your independent legal
counsel and/or professional tax advisor regarding any legal or tax issues
raised in this e-mail.
The contents of this e-mail and any attachments are intended solely for the use
of the named addressee(s) and may contain confidential and/or privileged
information. Any unauthorized use, copying, disclosure, or distribution of the
contents of this e-mail is strictly prohibited by the sender and may be
unlawful. If you are not the intended recipient, please notify the sender
immediately and delete this e-mail.