It's the last step that takes a lot of time.
DS <- arrow::open_dataset(sources = sourcePath)
listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
runs very fast, but as DS$files does not contain the "s3://" prefix, I have
to add it to listFiles in order to make the following work, and then
DS <- arrow::open_dataset(sources = listFiles)
runs quite some time. Is there another way of directly telling
open_dataset() that it has to look in a s3 bucket? Moreover, using that
approach also loses the partitioning information (from
variable1=x/variable2=y).
On Wed, Nov 6, 2024 at 5:56 PM Weston Pace <[email protected]> wrote:
> I wonder why your workaround is also slow:
>
> ```
> DS <- arrow::open_dataset(sources = sourcePath)
> listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
> DS2 <- arrow::open_dataset(sources = listFiles)
> ```
>
> That was going to be my suggestion. Do you know which of the three
> statements takes a long time? Maybe there is another R library you can use
> to quickly list and filter the files in the folder? Creating and scanning
> a dataset from a list of files (the DS2 line) should be fast.
>
> On Wed, Nov 6, 2024 at 5:43 AM Neal Richardson <
> [email protected]> wrote:
>
>> Looking at
>> https://arrow.apache.org/docs/r/reference/open_dataset.html#arg-factory-options,
>> it seems that `exclude_invalid_files` is slow on remote file systems
>> because of the cost of accessing each file up front to determine if it is
>> valid. And there is `selector_ignore_prefixes`, but it looks like you have
>> a suffix, which is unfortunate. I feel like I've heard of this $folder$
>> marker before and am not sure how others handle it. Maybe there is a way to
>> construct an S3FileSystem object with that filtering baked in somehow, and
>> you would pass that in to open_dataset(), but I'm not sure.
>>
>> I dug into the history of why this is only prefixes and not something
>> more general like regular expressions, and it looks like it was just an
>> expedient choice at the time. I filed
>> https://github.com/apache/arrow/issues/44662 about adding regex
>> filtering here, seems like it would be useful.
>>
>> Neal
>>
>>
>>
>> On Wed, Nov 6, 2024 at 3:11 AM Huschto, Tony <[email protected]>
>> wrote:
>>
>>> Dear all,
>>>
>>> I'm using the arrow package to access partitioned parquet data on an AWS
>>> S3 bucket. The structure is the typical
>>>
>>>
>>> s3://some_path/entity=ABC/syncDate=mm-dd-yyyy/country=US/part***.snappy.parquet
>>>
>>> Reading the files works very well using
>>>
>>> DS <- arrow::open_dataset(sources = "s3://some_path/entitiy=ABC")
>>> AT <- DS$NewScan()$Finish()$ToTable()
>>> DF <- as.data.frame(AT)
>>>
>>> But this works only if the structure only contains the parquet files. In
>>> some instances there are additional artifacts, e.g.
>>>
>>> s3://some_path/entity=ABC/syncDate=mm-dd-yyyy_$folder$
>>>
>>> which are files of size 0. Is there any way to set up the open_dataset()
>>> command to ignore these files? I tried the exclude_invalid_files option,
>>> but this takes forever. Furthermore I tried to eliminate the irrelevant
>>> files from DS$files, but wasn't able to manipulate this particular
>>> variable. Setting up something like
>>>
>>> DS <- arrow::open_dataset(sources = sourcePath)
>>> listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
>>> DS2 <- arrow::open_dataset(sources = listFiles)
>>>
>>> also takes an enormous amount of time.
>>>
>>> Any help is greatly appreciated!
>>>
>>> Thanks,
>>> Tony
>>>
>>> *Dr. Tony Huschto*
>>> Data Scientist
>>>
>>> Roche Diabetes Care GmbH
>>> DSRIBA
>>> Sandhofer Strasse 116
>>> 68305 Mannheim/Germany
>>>
>>> Phone: +4962175969845
>>> Mobile: +4915236987520
>>> mailto:[email protected] <[email protected]>
>>>
>>> *Roche Diabetes Care GmbH*
>>> Sandhofer Straße 116; D‑68305 Mannheim; Telefon +49‑621‑759‑0;
>>> Telefax +49‑621‑759‑2890
>>> Sitz der Gesellschaft: Mannheim -
>>> Registergericht: AG Mannheim HRB 720251 - Geschäftsführung: Marcel Hunn -
>>> Aufsichtsratsvorsitzender: Dr. Thomas Schinecker
>>> *Confidentiality Note*
>>> This message is intended only for the use of the named recipient(s) and
>>> may contain confidential and/or privileged information. If you are not the
>>> intended recipient, please contact the sender and delete the message. Any
>>> unauthorized use of the information contained in this message is prohibited.
>>>
>>> *Informationen zum Datenschutz:* www.roche.de/datenschutz
>>>
>>>