I wonder why your workaround is also slow:
```
DS <- arrow::open_dataset(sources = sourcePath)
listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
DS2 <- arrow::open_dataset(sources = listFiles)
```
That was going to be my suggestion. Do you know which of the three
statements takes a long time? Maybe there is another R library you can use
to quickly list and filter the files in the folder? Creating and scanning
a dataset from a list of files (the DS2 line) should be fast.
On Wed, Nov 6, 2024 at 5:43 AM Neal Richardson <[email protected]>
wrote:
> Looking at
> https://arrow.apache.org/docs/r/reference/open_dataset.html#arg-factory-options,
> it seems that `exclude_invalid_files` is slow on remote file systems
> because of the cost of accessing each file up front to determine if it is
> valid. And there is `selector_ignore_prefixes`, but it looks like you have
> a suffix, which is unfortunate. I feel like I've heard of this $folder$
> marker before and am not sure how others handle it. Maybe there is a way to
> construct an S3FileSystem object with that filtering baked in somehow, and
> you would pass that in to open_dataset(), but I'm not sure.
>
> I dug into the history of why this is only prefixes and not something more
> general like regular expressions, and it looks like it was just an
> expedient choice at the time. I filed
> https://github.com/apache/arrow/issues/44662 about adding regex filtering
> here, seems like it would be useful.
>
> Neal
>
>
>
> On Wed, Nov 6, 2024 at 3:11 AM Huschto, Tony <[email protected]>
> wrote:
>
>> Dear all,
>>
>> I'm using the arrow package to access partitioned parquet data on an AWS
>> S3 bucket. The structure is the typical
>>
>>
>> s3://some_path/entity=ABC/syncDate=mm-dd-yyyy/country=US/part***.snappy.parquet
>>
>> Reading the files works very well using
>>
>> DS <- arrow::open_dataset(sources = "s3://some_path/entitiy=ABC")
>> AT <- DS$NewScan()$Finish()$ToTable()
>> DF <- as.data.frame(AT)
>>
>> But this works only if the structure only contains the parquet files. In
>> some instances there are additional artifacts, e.g.
>>
>> s3://some_path/entity=ABC/syncDate=mm-dd-yyyy_$folder$
>>
>> which are files of size 0. Is there any way to set up the open_dataset()
>> command to ignore these files? I tried the exclude_invalid_files option,
>> but this takes forever. Furthermore I tried to eliminate the irrelevant
>> files from DS$files, but wasn't able to manipulate this particular
>> variable. Setting up something like
>>
>> DS <- arrow::open_dataset(sources = sourcePath)
>> listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
>> DS2 <- arrow::open_dataset(sources = listFiles)
>>
>> also takes an enormous amount of time.
>>
>> Any help is greatly appreciated!
>>
>> Thanks,
>> Tony
>>
>> *Dr. Tony Huschto*
>> Data Scientist
>>
>> Roche Diabetes Care GmbH
>> DSRIBA
>> Sandhofer Strasse 116
>> 68305 Mannheim/Germany
>>
>> Phone: +4962175969845
>> Mobile: +4915236987520
>> mailto:[email protected] <[email protected]>
>>
>> *Roche Diabetes Care GmbH*
>> Sandhofer Straße 116; D‑68305 Mannheim; Telefon +49‑621‑759‑0;
>> Telefax +49‑621‑759‑2890
>> Sitz der Gesellschaft: Mannheim -
>> Registergericht: AG Mannheim HRB 720251 - Geschäftsführung: Marcel Hunn -
>> Aufsichtsratsvorsitzender: Dr. Thomas Schinecker
>> *Confidentiality Note*
>> This message is intended only for the use of the named recipient(s) and
>> may contain confidential and/or privileged information. If you are not the
>> intended recipient, please contact the sender and delete the message. Any
>> unauthorized use of the information contained in this message is prohibited.
>>
>> *Informationen zum Datenschutz:* www.roche.de/datenschutz
>>
>>