The Arrow C++ libraries do some other one-time static initialization so we should find out if it's all due to importing pandas or something else
On Fri, Aug 28, 2020 at 1:48 AM Joris Van den Bossche <[email protected]> wrote: > > Hi Max, > > I assume (part of) the slowdown comes from trying to import pandas. If I add > an "import pandas" to your script, the difference with the first run is much > smaller (although still a difference). > > Inside the array function, we are lazily importing pandas to check if the > input is a pandas object. I suppose that in theory, if the input is a numpy > array, we should also be able to avoid this pandas import (maybe switching > the order of some checks). > > Best, > Joris > > On Fri, 28 Aug 2020 at 01:10, Max Grossman <[email protected]> wrote: >> >> Hi all, >> >> Say I've got a simple program like the following that converts a numpy array >> to a pyarrow array several times in a row, and times each of those >> conversions: >> >> import pyarrow >> import numpy as np >> import time >> >> arr = np.random.rand(1) >> >> t1 = time.time() >> pyarrow.array(arr) >> t2 = time.time() >> pyarrow.array(arr) >> t3 = time.time() >> pyarrow.array(arr) >> t4 = time.time() >> pyarrow.array(arr) >> t5 = time.time() >> >> I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s >> while the rest are nearly instantaneous (1e-05s). >> >> Does anyone know what might be causing this? My assumption is some one-time >> initialization of pyarrow on the first call to the library, in which case >> I'd like to see if there's some way to explicitly trigger that >> initialization earlier in the program. But also curious to hear if there is >> a different explanation. >> >> Right now I'm working around this by just calling pyarrow.array([]) at >> application start up -- I realize this doesn't actually eliminate the added >> time, but it does move it out of the critical section for any benchmarking >> runs. >> >> Thanks, >> >> Max
