hi Alex,

It looks like the mallocs are coming from Thrift
(parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we
can do much about this. I'm curious if it's possible to pass a custom
STL allocator to Thrift so we could use a different allocation
strategy than the default STL allocator

- Wes

On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang <[email protected]> wrote:
> Hi,
>
> I'm reading parquet file (generated by Java parquet library).  Our schema
> has 400 columns (including non-array elements, 1-dimensional array
> elements).
>
> I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
>
> I compile parquet-cpp with following cmake options,
> ```
> cmake3    -DCMAKE_BUILD_TYPE=Debug     -DPARQUET_BUILD_EXAMPLES=OFF
>  -DPARQUET_BUILD_TESTS=OFF     -DPARQUET_ARROW_LINKAGE="static"
>  -DPARQUET_BUILD_SHARED=OFF     -DPARQUET_BOOST_USE_SHARED=OFF .
> ```
>
> One thing we noticed is that the cpp library conducts a lot of small
> mallocs during the open file and the reading metadata phases...  shown
> below:
>
> ```
> (gdb) where
> #0  0x00007fdf40594801 in malloc () from /lib64/libc.so.6
> #1  0x00007fdf40e52ecd in operator new(unsigned long) () from
> /lib64/libstdc++.so.6
> #2  0x0000000000ea16c0 in __gnu_cxx::new_allocator<std::string>::allocate
> (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
> #3  0x0000000000e9eabb in std::_Vector_base<std::string,
> std::allocator<std::string> >::_M_allocate (this=0x33e6930, __n=3) at
> /usr/include/c++/4.8.2/bits/stl_vector.h:168
> #4  0x0000000000ecf512 in std::vector<std::string,
> std::allocator<std::string> >::_M_default_append (this=0x33e6930, __n=3) at
> /usr/include/c++/4.8.2/bits/vector.tcc:549
> #5  0x0000000000eca887 in std::vector<std::string,
> std::allocator<std::string> >::resize (this=0x33e6930, __new_size=3) at
> /usr/include/c++/4.8.2/bits/stl_vector.h:667
> #6  0x0000000000ebd589 in parquet::format::ColumnMetaData::read
> (this=0x33e6908, iprot=0x3337300) at
> /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
> #7  0x0000000000ebf9ed in parquet::format::ColumnChunk::read
> (this=0x33e68f0, iprot=0x3337300) at
> /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
> #8  0x0000000000ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0,
> iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
> #9  0x0000000000ec4e22 in parquet::format::FileMetaData::read
> (this=0x3337270, iprot=0x3337300) at
> /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
> #10 0x0000000000e9364d in
> parquet::DeserializeThriftMsg<parquet::format::FileMetaData>
> (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
> /opt/parquet-cpp/src/parquet/thrift.h:119
> #11 0x0000000000e8fda5 in
> parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0,
> metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303
> #12 0x0000000000e8bf4f in parquet::FileMetaData::FileMetaData
> (this=0x31a4ca0, metadata=0x7fdf2cace040
> "\025\002\031\374\313\004H\bsessions\025\374\005",
> metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403
> #13 0x0000000000e8bee3 in parquet::FileMetaData::Make
> (metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
> metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398
> #14 0x0000000000e87572 in parquet::SerializedFile::ParseMetaData
> (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
> #15 0x0000000000e858d4 in parquet::ParquetFileReader::Contents::Open
> (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:247
> #16 0x0000000000e85a6f in parquet::ParquetFileReader::Open
> (source=std::unique_ptr<parquet::RandomAccessSource> containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:265
> #17 0x0000000000e859ba in parquet::ParquetFileReader::Open
> (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
> metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:259
> #18 0x0000000000e85df4 in parquet::ParquetFileReader::OpenFile
> (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
> memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:287
>
> (gdb) info br
> Num     Type           Disp Enb Address            What
> 1       breakpoint     keep y   <MULTIPLE>
>         breakpoint already hit 2679 times
>         ignore next 2321 hits
> ```
>
> I set the breakpoint to `malloc`, above ^
>
> This seems to be the case regardless of mmap option.
>
> Would really appreciate some pointer on how to avoid this.
>
> Thanks,
> Alex Wang,
>
> --
> Alex Wang,
> Open vSwitch developer

Reply via email to