Hi Jayjeet, I noticed that you're not compiling dataset_bench with optimizations enabled. I'm not sure how much it will help, but it may be worth adding `-O2` to your g++ invocation.
Sasha Krassovsky On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < [email protected]> wrote: > Hi Niranda, David, > > I ran my benchmarks again with the PyArrow .SO libraries which should be > optimized. PyArrow version was 6.0.1 installed from pip. Here are my new > results [1]. Numbers didn't quite seem to improve. You can check my build > config in the Makefile [2]. I created a README [3] to make it easy for you > to reproduce on your end. Thanks. > > [1] > https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized > [2] > https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile > [3] > https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md > > On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <[email protected]> > wrote: > >> Hi Jayeet, >> >> Could you try building your cpp project against the arrow.so in pyarrow >> installation? It should be in the lib directory in your python environment. >> >> Best >> >> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >> [email protected]> wrote: >> >>> Thanks for your reply, David. >>> >>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>> 2) The dataset was deployed using this [1] script. >>> 3) For C++, Arrow was built from source in release mode. You can see the >>> CMake config here [2]. >>> >>> I think I need to test once with Arrow C++ installed from packages >>> instead of me building it. That might be the issue. >>> >>> [1] >>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>> [2] >>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>> >>> Best, >>> Jayjeet >>> >>> >>> >>> >>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote: >>> >>>> Hi Jayjeet, >>>> >>>> That's odd since the Python API is just wrapping the C++ API, so they >>>> should be identical if everything is configured the same. (So is the Java >>>> API, incidentally.) That's effectively what the SO question is saying. >>>> >>>> What versions of PyArrow and Arrow are you using? Just to check the >>>> obvious things, was Arrow compiled with optimizations? And if we want to >>>> replicate this, is it possible to get the dataset? >>>> >>>> -David >>>> >>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>> >>>> Hi Arrow community, >>>> >>>> I was working on a class project for benchmarking Apache Arrow Dataset >>>> API in different programming languages. I found out that for some reason >>>> the C++ API example is slower than the Python API example. I ran my >>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I tried >>>> my best to cross verify if all the parameters are similar in the Python and >>>> C++ examples. It would be great to know if someone had similar observations >>>> in the past and if the reason for this is known. I would really like to >>>> know more about this phenomenon. You can find the code and the results here >>>> [1]. I found a similar issue here [2] but I couldn't understand the exact >>>> reason. Thanks a lot for your help. >>>> >>>> >>>> [1] >>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>> >>>> [2] >>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>> >>>> Best Regards, >>>> *Jayjeet Chakraborty* >>>> Ph.D. Student >>>> Department of Computer Science and Engineering >>>> University of California, Santa Cruz >>>> >>>> -- >>>> *Jayjeet Chakraborty* >>>> B.Tech in Computer Sc. and Engineering >>>> National Institute Of Technology, Durgapur >>>> West Bengal, India >>>> M: (+91) 8436500886 >>>> >>>> >>>> >>> >>> -- >>> *Jayjeet Chakraborty* >>> B.Tech in Computer Sc. and Engineering >>> National Institute Of Technology, Durgapur >>> West Bengal, India >>> M: (+91) 8436500886 >>> >> >> >> -- >> Niranda Perera >> https://niranda.dev/ >> @n1r44 <https://twitter.com/N1R44> >> >> > > -- > *Jayjeet Chakraborty* > B.Tech in Computer Sc. and Engineering > National Institute Of Technology, Durgapur > West Bengal, India > M: (+91) 8436500886 >
