[ https://issues.apache.org/jira/browse/ARROW-17597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600838#comment-17600838 ]
Carl Boettiger commented on ARROW-17597: ---------------------------------------- Just a note, but I think the additional latency in S3 version here is not nearly as significant if the csv file is not compressed. Reading directly from http is unsurprisingly a deal faster compressed than uncompressed, so it is weird that it adds latency here. > [R][C++] Why is read_csv_arrow so much slower when using S3 path notation? > -------------------------------------------------------------------------- > > Key: ARROW-17597 > URL: https://issues.apache.org/jira/browse/ARROW-17597 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Reporter: Carl Boettiger > Priority: Minor > > Consider these two mechanisms for reading from a public bucket. I was struck > to see that using S3 path notation was consistently over 20 times slower than > using the https address directly. I could imagine a small overhead for using > S3, but compared to other operations this seems something weird is going on > here: > {code:java} > library(arrow) > targe <- s3_bucket("neon4cast-targets", > endpoint_override="data.ecoforecast.org", anonymous=TRUE) > bench::bench_time({ # 58.6 seconds > ex1 <- > read_csv_arrow(targe$path("terrestrial_30min/terrestrial_30min-targets.csv.gz")) > }) > bench::bench_time({ # 2.7 sec > ex2 <- > read_csv_arrow("https://data.ecoforecast.org/neon4cast-targets/terrestrial_30min/terrestrial_30min-targets.csv.gz") > }) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)