Hi all,

Say I've got a simple program like the following that converts a numpy array to a pyarrow array several times in a row, and times each of those conversions:

   import pyarrow
   import numpy as np
   import time

   arr = np.random.rand(1)

   t1 = time.time()
   pyarrow.array(arr)
   t2 = time.time()
   pyarrow.array(arr)
   t3 = time.time()
   pyarrow.array(arr)
   t4 = time.time()
   pyarrow.array(arr)
   t5 = time.time()

I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s while the rest are nearly instantaneous (1e-05s).

Does anyone know what might be causing this? My assumption is some one-time initialization of pyarrow on the first call to the library, in which case I'd like to see if there's some way to explicitly trigger that initialization earlier in the program. But also curious to hear if there is a different explanation.

Right now I'm working around this by just calling pyarrow.array([]) at application start up -- I realize this doesn't actually eliminate the added time, but it does move it out of the critical section for any benchmarking runs.

Thanks,

Max

Reply via email to