[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278186#comment-17278186 ]
Neal Richardson commented on ARROW-11433: ----------------------------------------- If mimalloc is faster, perhaps we should use it instead of jemalloc > [R] Unexpectedly slow results reading csv > ----------------------------------------- > > Key: ARROW-11433 > URL: https://issues.apache.org/jira/browse/ARROW-11433 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Jonathan Keane > Priority: Minor > > This came up working on benchmarking Arrow's CSV reading. As far as I can > tell this only impacts R, and only when reading the csv into arrow (but not > pulling it in to R). It appears that most arrow interactions after the csv is > read will result in this behavior not happening. > What I'm seeing is that on subsequent reads, the time to read gets longer and > longer (frequently in a stair step pattern where every other iteration takes > longer). > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + tab <- NULL > + } > + }) > user system elapsed > 24.788 19.485 7.216 > user system elapsed > 24.952 21.786 9.225 > user system elapsed > 25.150 23.039 10.332 > user system elapsed > 25.382 31.012 17.995 > user system elapsed > 25.309 25.140 12.356 > user system elapsed > 25.302 26.975 13.938 > user system elapsed > 25.509 34.390 21.134 > user system elapsed > 25.674 28.195 15.048 > user system elapsed > 25.031 28.094 16.449 > user system elapsed > 25.825 37.165 23.379 > # total time: > user system elapsed > 256.178 299.671 175.119 > {code} > Interestingly, doing something as unrelated as > {{arrow:::default_memory_pool()}} which is [only getting the default memory > pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. > Other interactions totally unrelated to the table also similarly alleviate > this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or > proactively invalidating with {{tab$invalidate()}} > {code:r} > > system.time({ > + for (i in 1:10) { > + print(system.time(tab <- > read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE))) > + pool <- arrow:::default_memory_pool() > + tab <- NULL > + } > + }) > user system elapsed > 25.257 19.475 6.785 > user system elapsed > 25.271 19.838 6.821 > user system elapsed > 25.288 20.103 6.861 > user system elapsed > 25.188 20.290 7.217 > user system elapsed > 25.283 20.043 6.832 > user system elapsed > 25.194 19.947 6.906 > user system elapsed > 25.278 19.993 6.834 > user system elapsed > 25.355 20.018 6.833 > user system elapsed > 24.986 19.869 6.865 > user system elapsed > 25.130 19.878 6.798 > # total time: > user system elapsed > 255.381 210.598 83.109 > > > {code} > I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the > same behavior. > I checked against pyarrow, and do not see the same: > {code:python} > from pyarrow import csv > import time > for i in range(1, 10): > start = time.time() > table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv") > print(time.time() - start) > del table > {code} > results: > {code} > 7.586184978485107 > 7.542470932006836 > 7.92852783203125 > 7.647372007369995 > 7.742412805557251 > 8.101378917694092 > 7.7359960079193115 > 7.843957901000977 > 7.6457719802856445 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)