Looking at
https://arrow.apache.org/docs/r/reference/open_dataset.html#arg-factory-options,
it seems that `exclude_invalid_files` is slow on remote file systems
because of the cost of accessing each file up front to determine if it is
valid. And there is `selector_ignore_prefixes`, but it looks like you have
a suffix, which is unfortunate. I feel like I've heard of this $folder$
marker before and am not sure how others handle it. Maybe there is a way to
construct an S3FileSystem object with that filtering baked in somehow, and
you would pass that in to open_dataset(), but I'm not sure.

I dug into the history of why this is only prefixes and not something more
general like regular expressions, and it looks like it was just an
expedient choice at the time. I filed
https://github.com/apache/arrow/issues/44662 about adding regex filtering
here, seems like it would be useful.

Neal



On Wed, Nov 6, 2024 at 3:11 AM Huschto, Tony <[email protected]> wrote:

> Dear all,
>
> I'm using the arrow package to access partitioned parquet data on an AWS
> S3 bucket. The structure is the typical
>
>
> s3://some_path/entity=ABC/syncDate=mm-dd-yyyy/country=US/part***.snappy.parquet
>
> Reading the files works very well using
>
> DS <- arrow::open_dataset(sources = "s3://some_path/entitiy=ABC")
> AT <- DS$NewScan()$Finish()$ToTable()
> DF <- as.data.frame(AT)
>
> But this works only if the structure only contains the parquet files. In
> some instances there are additional artifacts, e.g.
>
> s3://some_path/entity=ABC/syncDate=mm-dd-yyyy_$folder$
>
> which are files of size 0. Is there any way to set up the open_dataset()
> command to ignore these files? I tried the exclude_invalid_files option,
> but this takes forever. Furthermore I tried to eliminate the irrelevant
> files from DS$files, but wasn't able to manipulate this particular
> variable. Setting up something like
>
> DS <- arrow::open_dataset(sources = sourcePath)
> listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
> DS2 <- arrow::open_dataset(sources = listFiles)
>
> also takes an enormous amount of time.
>
> Any help is greatly appreciated!
>
> Thanks,
> Tony
>
> *Dr. Tony Huschto*
> Data Scientist
>
> Roche Diabetes Care GmbH
> DSRIBA
> Sandhofer Strasse 116
> 68305 Mannheim/Germany
>
> Phone: +4962175969845
> Mobile: +4915236987520
> mailto:[email protected] <[email protected]>
>
> *Roche Diabetes Care GmbH*
> Sandhofer Straße 116; D‑68305 Mannheim; Telefon +49‑621‑759‑0;
> Telefax +49‑621‑759‑2890
> Sitz der Gesellschaft: Mannheim -
> Registergericht: AG Mannheim HRB 720251 - Geschäftsführung: Marcel Hunn -
> Aufsichtsratsvorsitzender: Dr. Thomas Schinecker
> *Confidentiality Note*
> This message is intended only for the use of the named recipient(s) and
> may contain confidential and/or privileged information. If you are not the
> intended recipient, please contact the sender and delete the message. Any
> unauthorized use of the information contained in this message is prohibited.
>
> *Informationen zum Datenschutz:* www.roche.de/datenschutz
>
>

Reply via email to