The Arrow C++ libraries do some other one-time static initialization
so we should find out if it's all due to importing pandas or something
else

On Fri, Aug 28, 2020 at 1:48 AM Joris Van den Bossche
<[email protected]> wrote:
>
> Hi Max,
>
> I assume (part of) the slowdown comes from trying to import pandas. If I add 
> an "import pandas" to your script, the difference with the first run is much 
> smaller (although still a difference).
>
> Inside the array function, we are lazily importing pandas to check if the 
> input is a pandas object. I suppose that in theory, if the input is a numpy 
> array, we should also be able to avoid this pandas import (maybe switching 
> the order of some checks).
>
> Best,
> Joris
>
> On Fri, 28 Aug 2020 at 01:10, Max Grossman <[email protected]> wrote:
>>
>> Hi all,
>>
>> Say I've got a simple program like the following that converts a numpy array 
>> to a pyarrow array several times in a row, and times each of those 
>> conversions:
>>
>> import pyarrow
>> import numpy as np
>> import time
>>
>> arr = np.random.rand(1)
>>
>> t1 = time.time()
>> pyarrow.array(arr)
>> t2 = time.time()
>> pyarrow.array(arr)
>> t3 = time.time()
>> pyarrow.array(arr)
>> t4 = time.time()
>> pyarrow.array(arr)
>> t5 = time.time()
>>
>> I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s 
>> while the rest are nearly instantaneous (1e-05s).
>>
>> Does anyone know what might be causing this? My assumption is some one-time 
>> initialization of pyarrow on the first call to the library, in which case 
>> I'd like to see if there's some way to explicitly trigger that 
>> initialization earlier in the program. But also curious to hear if there is 
>> a different explanation.
>>
>> Right now I'm working around this by just calling pyarrow.array([]) at 
>> application start up -- I realize this doesn't actually eliminate the added 
>> time, but it does move it out of the critical section for any benchmarking 
>> runs.
>>
>> Thanks,
>>
>> Max

Reply via email to