KristofferC opened a new issue, #528:
URL: https://github.com/apache/arrow-julia/issues/528
Running the following code which generates some data and then reads it via
`Arrow.Table` shows a very bad slow down when using threads:
```julia
using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings
function generate_data(f)
number_of_companies = 10000
dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31))
companyid = sample(100000:1000000, number_of_companies, replace = false)
number_of_items = length(companyid)*length(dates)
df = DataFrame(
dates = repeat(dates, outer = number_of_companies),
companyid = repeat(companyid, inner = length(dates)),
item1 = rand(number_of_items),
item2 = randn(number_of_items),
item3 = rand(1:1000,number_of_items),
item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _
in 1:number_of_companies],length(dates))
)
@info "Saving to $f"
open(f, "w") do f
Arrow.write(f, Tables.partitioner(groupby(df,:dates)))
end
end
f = "mytestdata.arrow"
if !isfile(f)
generate_data(f)
end
Arrow.Table(f)
@time Arrow.Table(f)
```
Results:
```
❯julia arrowthreads.jl
0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32%
compilation time)
❯ julia --project --threads=3 arrowthreads.jl
6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time)
```
We can see that `Arrow.Table` spawns a task here
https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528
and from profiling we are spending almost all time waiting on the lock in
https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]