Moelf opened a new issue, #417:
URL: https://github.com/apache/arrow-julia/issues/417
this is an example to demonstrate what "early returning" mean following a
discussion on Slack with Alexander Plavin
## tl;dr
The idea is that you have N columns you filter on, however, maybe 20% of the
N columns are enough to **early** fail 80% of the rows. We need an ergonomic
interface to delay the allocation (or worse, when Feather is compressed) as
much as possible.
## Setup
```julia
julia> using Arrow
julia> function gendata()
x = [rand(rand(0:10)) for _ = 1:10^5]
y = [randn(rand(0:10)) for _ = 1:10^5]
(;x, y)
end
julia> foreach(1:10) do _
Arrow.append("./out.feather", gendata())
end
```
## Benchmark
```julia
julia> function kernel1(xs, ys)
s1 = maximum(ys; init=0.0)
s1 < 5 && return false
maximum(xs; init=0.0) < 0.7 && return false
return true
end
kernel1 (generic function with 1 method)
julia> @benchmark map(kernel1, tbl.x, tbl.y)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 264.955 ms … 271.889 ms ┊ GC (min … max): 3.83% … 4.93%
Time (median): 267.430 ms ┊ GC (median): 3.82%
Time (mean ± σ): 267.704 ms ± 2.096 ms ┊ GC (mean ± σ): 4.17% ± 0.47%
▁ ▁ ██ ▁▁ ▁ ▁ █▁ ▁ ▁ ▁ ▁ ▁ ▁
█▁▁▁█▁██▁▁▁▁██▁▁▁▁▁█▁█▁▁▁▁▁██▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁█ ▁
265 ms Histogram: frequency by time 272 ms <
Memory estimate: 192.34 MiB, allocs estimate: 2000004.
julia> function kernel2(xss, yss)
map(eachindex(xss)) do i
ys = yss[i]
s1 = maximum(ys; init=0.0)
s1 < 5 && return false
xs = xss[i]
maximum(xs; init=0.0) < 0.7 && return false
end
end
kernel2 (generic function with 1 method)
julia> @benchmark kernel2(tbl.x, tbl.y)
BenchmarkTools.Trial: 34 samples with 1 evaluation.
Range (min … max): 149.177 ms … 156.043 ms ┊ GC (min … max): 3.57% … 3.49%
Time (median): 150.438 ms ┊ GC (median): 3.57%
Time (mean ± σ): 151.088 ms ± 1.503 ms ┊ GC (mean ± σ): 3.92% ± 0.69%
██▂ ▂
▅▁▅▁▁▁▅▅███▅█▅▁█▅▁▁▁▁▁▅▁▁▁▁▁▁▅▁▅▅▁▅▅▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
149 ms Histogram: frequency by time 156 ms <
Memory estimate: 96.62 MiB, allocs estimate: 1000002.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]