[ https://issues.apache.org/jira/browse/ARROW-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-6529: ----------------------------------------- Description: >From >https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none Smaller example with just using pyarrow, it seems that writing an array of nulls takes much longer than an array of for example ints, which seems a bit strange: {code} In [93]: arr = pa.array([None]*1000, type='int64') In [94]: %%timeit ...: w = pyarrow.feather.FeatherWriter('__test.feather') ...: w.writer.write_array('x', arr) ...: w.writer.close() 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [95]: arr = pa.array([None]*1000) In [96]: arr Out[96]: <pyarrow.lib.NullArray object at 0x7fa47a23ca40> 1000 nulls In [97]: %%timeit ...: w = pyarrow.feather.FeatherWriter('__test.feather') ...: w.writer.write_array('x', arr) ...: w.writer.close() 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) {code} So writing the same length NullArray takes ca 100x more time compared to an array of nulls but with Integer type. was: >From >https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none Smaller example with just using pyarrow, it seems that writing an array of nulls takes much longer than an array of for example ints, which seems a bit strange: {code} In [93]: arr = pa.array([1]*1000) In [94]: %%timeit ...: w = pyarrow.feather.FeatherWriter('__test.feather') ...: w.writer.write_array('x', arr) ...: w.writer.close() 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [95]: arr = pa.array([None]*1000) In [96]: arr Out[96]: <pyarrow.lib.NullArray object at 0x7fa47a23ca40> 1000 nulls In [97]: %%timeit ...: w = pyarrow.feather.FeatherWriter('__test.feather') ...: w.writer.write_array('x', arr) ...: w.writer.close() 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) {code} So writing the same length NullArray takes ca 100x more time. > [C++] Feather: slow writing of NullArray > ---------------------------------------- > > Key: ARROW-6529 > URL: https://issues.apache.org/jira/browse/ARROW-6529 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Labels: feather > > From > https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none > Smaller example with just using pyarrow, it seems that writing an array of > nulls takes much longer than an array of for example ints, which seems a bit > strange: > {code} > In [93]: arr = pa.array([None]*1000, type='int64') > In [94]: %%timeit > ...: w = pyarrow.feather.FeatherWriter('__test.feather') > ...: w.writer.write_array('x', arr) > ...: w.writer.close() > 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) > In [95]: arr = pa.array([None]*1000) > In [96]: arr > Out[96]: > <pyarrow.lib.NullArray object at 0x7fa47a23ca40> > 1000 nulls > In [97]: %%timeit > ...: w = pyarrow.feather.FeatherWriter('__test.feather') > ...: w.writer.write_array('x', arr) > ...: w.writer.close() > 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > {code} > So writing the same length NullArray takes ca 100x more time compared to an > array of nulls but with Integer type. -- This message was sent by Atlassian Jira (v8.3.2#803003)