Hi Max, I assume (part of) the slowdown comes from trying to import pandas. If I add an "import pandas" to your script, the difference with the first run is much smaller (although still a difference).
Inside the array function, we are lazily importing pandas to check if the input is a pandas object. I suppose that in theory, if the input is a numpy array, we should also be able to avoid this pandas import (maybe switching the order of some checks). Best, Joris On Fri, 28 Aug 2020 at 01:10, Max Grossman <[email protected]> wrote: > Hi all, > > Say I've got a simple program like the following that converts a numpy > array to a pyarrow array several times in a row, and times each of those > conversions: > > import pyarrow > import numpy as np > import time > > arr = np.random.rand(1) > > t1 = time.time() > pyarrow.array(arr) > t2 = time.time() > pyarrow.array(arr) > t3 = time.time() > pyarrow.array(arr) > t4 = time.time() > pyarrow.array(arr) > t5 = time.time() > > I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s > while the rest are nearly instantaneous (1e-05s). > > Does anyone know what might be causing this? My assumption is some > one-time initialization of pyarrow on the first call to the library, in > which case I'd like to see if there's some way to explicitly trigger that > initialization earlier in the program. But also curious to hear if there is > a different explanation. > > Right now I'm working around this by just calling pyarrow.array([]) at > application start up -- I realize this doesn't actually eliminate the added > time, but it does move it out of the critical section for any benchmarking > runs. > > Thanks, > > Max >
