[ https://issues.apache.org/jira/browse/ARROW-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238201#comment-17238201 ]
Gert Hulselmans commented on ARROW-10056: ----------------------------------------- I am not sure that I am hitting the 2GB FlatBufffer limit as that would imply that each column (assuming we would have a 1 million columns) would occupy more than 2kb of space. It seems more likely to me that I hit the max_depth = 64 or max_tables = 1000000 limit According to https://groups.google.com/g/flatbuffers/c/JtDGnBPx9is, it seems like these parameters are changeable: {code:c++} /// To expand the capacity of a single buffer, _max_tables is set to 10000000 flatbuffers::uoffset_t _max_depth = 64; flatbuffers::uoffset_t _max_tables = 10000000; flatbuffers::Verifier verifier(buf, bufsize, _max_depth, _max_tables); OK = OK && grl::flatbuffer::VerifyKUKAiiwaStatesBuffer(verifier); {code} > [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of > flatbuffer-encoded Footer failed. > ----------------------------------------------------------------------------------------------------------- > > Key: ARROW-10056 > URL: https://issues.apache.org/jira/browse/ARROW-10056 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Environment: CentOS7 > conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1 > Reporter: Gert Hulselmans > Priority: Major > Fix For: 3.0.0 > > > pyarrow writes an invalid Feather v2 file, which it can't read afterwards. > {code:java} > OSError: Verification of flatbuffer-encoded Footer failed. > {code} > The following code reproduces the problem for me: > {code:python} > import pyarrow as pa > import numpy as np > import pandas as pd > nbr_regions = 1223024 > nbr_motifs = 4891 > # Create (big) dataframe. > df = pd.DataFrame( > np.arange(nbr_regions * nbr_motifs, > dtype=np.float32).reshape((nbr_regions, nbr_motifs)), > index=pd.Index(['region' + str(i) for i in range(nbr_regions)], > name='regions'), > columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], > name='motifs') > ) > # Transpose dataframe > df_transposed = df.transpose() > # Write transposed dataframe to Feather v2 format. > pf.write_feather(df_transposed, 'df_transposed.feather') > # Trying to read the transposed dataframe from Feather v2 format, results in > this error: > df_transposed_read = pf.read_feather('df_transposed.feather') > {code} > {code:python} > --------------------------------------------------------------------------- > OSError Traceback (most recent call last) > <ipython-input-64-b41ad5157e77> in <module> > ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather') > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py > in read_feather(source, columns, use_threads, memory_map) > 213 """ > 214 _check_pandas_version() > --> 215 return (read_table(source, columns=columns, memory_map=memory_map) > 216 .to_pandas(use_threads=use_threads)) > 217 > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py > in read_table(source, columns, memory_map) > 235 """ > 236 reader = ext.FeatherReader() > --> 237 reader.open(source, use_memory_map=memory_map) > 238 > 239 if columns is None: > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi > in pyarrow.lib.FeatherReader.open() > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.pyarrow_internal_check_status() > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > OSError: Verification of flatbuffer-encoded Footer failed. > {code} > Later I discovered that it happens also if the original dataframe is created > in the transposed order: > {code:python} > # Create (big) dataframe. > df_without_transpose = pd.DataFrame( > np.arange(nbr_motifs * nbr_regions, > dtype=np.float32).reshape((nbr_motifs, nbr_regions)), > index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], > name='motifs'), > columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], > name='regions'), > ) > pf.write_feather(df_without_transpose, 'df_without_transpose.feather') > df_without_transpose_read = pf.read_feather('df_without_transpose.feather') > --------------------------------------------------------------------------- > OSError Traceback (most recent call last) > <ipython-input-91-3cdad1d58c35> in <module> > ----> 1 df_without_transpose_read = > pf.read_feather('df_without_transpose.feather') > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py > in read_feather(source, columns, use_threads, memory_map) > 213 """ > 214 _check_pandas_version() > --> 215 return (read_table(source, columns=columns, memory_map=memory_map) > 216 .to_pandas(use_threads=use_threads)) > 217 > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py > in read_table(source, columns, memory_map) > 235 """ > 236 reader = ext.FeatherReader() > --> 237 reader.open(source, use_memory_map=memory_map) > 238 > 239 if columns is None: > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi > in pyarrow.lib.FeatherReader.open() > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.pyarrow_internal_check_status() > /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > OSError: Verification of flatbuffer-encoded Footer failed. > {code} > Writing to Feather v1 format works: > {code:python} > pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1) > df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather') > # Now do the same, but also save the index in the Feather v1 file. > df_transposed_reset_index = df_transposed.reset_index() > pf.write_feather(df_transposed_reset_index, > 'df_transposed_reset_index.v1.feather', version=1) > df_transposed_reset_index_read_v1 = > pf.read_feather('df_transposed_reset_index.v1.feather') > # Returns True > df_transposed_reset_index_read_v1.equals(df_transposed) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)