Dear all,
I'm using the arrow package to access partitioned parquet data on an AWS S3
bucket. The structure is the typical
s3://some_path/entity=ABC/syncDate=mm-dd-yyyy/country=US/part***.snappy.parquet
Reading the files works very well using
DS <- arrow::open_dataset(sources = "s3://some_path/entitiy=ABC")
AT <- DS$NewScan()$Finish()$ToTable()
DF <- as.data.frame(AT)
But this works only if the structure only contains the parquet files. In
some instances there are additional artifacts, e.g.
s3://some_path/entity=ABC/syncDate=mm-dd-yyyy_$folder$
which are files of size 0. Is there any way to set up the open_dataset()
command to ignore these files? I tried the exclude_invalid_files option,
but this takes forever. Furthermore I tried to eliminate the irrelevant
files from DS$files, but wasn't able to manipulate this particular
variable. Setting up something like
DS <- arrow::open_dataset(sources = sourcePath)
listFiles <- DS$files[!grepl("$folder$",DS$files,fixed=TRUE)]
DS2 <- arrow::open_dataset(sources = listFiles)
also takes an enormous amount of time.
Any help is greatly appreciated!
Thanks,
Tony
*Dr. Tony Huschto*
Data Scientist
Roche Diabetes Care GmbH
DSRIBA
Sandhofer Strasse 116
68305 Mannheim/Germany
Phone: +4962175969845
Mobile: +4915236987520
mailto:[email protected] <[email protected]>
*Roche Diabetes Care GmbH*
Sandhofer Straße 116; D‑68305 Mannheim; Telefon +49‑621‑759‑0;
Telefax +49‑621‑759‑2890
Sitz der Gesellschaft: Mannheim - Registergericht: AG Mannheim HRB 720251 -
Geschäftsführung: Marcel Hunn - Aufsichtsratsvorsitzender: Dr. Thomas
Schinecker
*Confidentiality Note*
This message is intended only for the use of the named recipient(s) and may
contain confidential and/or privileged information. If you are not the
intended recipient, please contact the sender and delete the message. Any
unauthorized use of the information contained in this message is prohibited.
*Informationen zum Datenschutz:* www.roche.de/datenschutz