Hi all,
Say I've got a simple program like the following that converts a numpy
array to a pyarrow array several times in a row, and times each of those
conversions:
import pyarrow
import numpy as np
import time
arr = np.random.rand(1)
t1 = time.time()
pyarrow.array(arr)
t2 = time.time()
pyarrow.array(arr)
t3 = time.time()
pyarrow.array(arr)
t4 = time.time()
pyarrow.array(arr)
t5 = time.time()
I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s
while the rest are nearly instantaneous (1e-05s).
Does anyone know what might be causing this? My assumption is some
one-time initialization of pyarrow on the first call to the library, in
which case I'd like to see if there's some way to explicitly trigger
that initialization earlier in the program. But also curious to hear if
there is a different explanation.
Right now I'm working around this by just calling pyarrow.array([]) at
application start up -- I realize this doesn't actually eliminate the
added time, but it does move it out of the critical section for any
benchmarking runs.
Thanks,
Max