Hi Niranda, David, I ran my benchmarks again with the PyArrow .SO libraries which should be optimized. PyArrow version was 6.0.1 installed from pip. Here are my new results [1]. Numbers didn't quite seem to improve. You can check my build config in the Makefile [2]. I created a README [3] to make it easy for you to reproduce on your end. Thanks.
[1] https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized [2] https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile [3] https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <[email protected]> wrote: > Hi Jayeet, > > Could you try building your cpp project against the arrow.so in pyarrow > installation? It should be in the lib directory in your python environment. > > Best > > On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < > [email protected]> wrote: > >> Thanks for your reply, David. >> >> 1) I used PyArrow 6.0.1 for both C++ and Python. >> 2) The dataset was deployed using this [1] script. >> 3) For C++, Arrow was built from source in release mode. You can see the >> CMake config here [2]. >> >> I think I need to test once with Arrow C++ installed from packages >> instead of me building it. That might be the issue. >> >> [1] >> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >> [2] >> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >> >> Best, >> Jayjeet >> >> >> >> >> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote: >> >>> Hi Jayjeet, >>> >>> That's odd since the Python API is just wrapping the C++ API, so they >>> should be identical if everything is configured the same. (So is the Java >>> API, incidentally.) That's effectively what the SO question is saying. >>> >>> What versions of PyArrow and Arrow are you using? Just to check the >>> obvious things, was Arrow compiled with optimizations? And if we want to >>> replicate this, is it possible to get the dataset? >>> >>> -David >>> >>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>> >>> Hi Arrow community, >>> >>> I was working on a class project for benchmarking Apache Arrow Dataset >>> API in different programming languages. I found out that for some reason >>> the C++ API example is slower than the Python API example. I ran my >>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I tried >>> my best to cross verify if all the parameters are similar in the Python and >>> C++ examples. It would be great to know if someone had similar observations >>> in the past and if the reason for this is known. I would really like to >>> know more about this phenomenon. You can find the code and the results here >>> [1]. I found a similar issue here [2] but I couldn't understand the exact >>> reason. Thanks a lot for your help. >>> >>> >>> [1] >>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>> >>> [2] >>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>> >>> Best Regards, >>> *Jayjeet Chakraborty* >>> Ph.D. Student >>> Department of Computer Science and Engineering >>> University of California, Santa Cruz >>> >>> -- >>> *Jayjeet Chakraborty* >>> B.Tech in Computer Sc. and Engineering >>> National Institute Of Technology, Durgapur >>> West Bengal, India >>> M: (+91) 8436500886 >>> >>> >>> >> >> -- >> *Jayjeet Chakraborty* >> B.Tech in Computer Sc. and Engineering >> National Institute Of Technology, Durgapur >> West Bengal, India >> M: (+91) 8436500886 >> > > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > > -- *Jayjeet Chakraborty* B.Tech in Computer Sc. and Engineering National Institute Of Technology, Durgapur West Bengal, India M: (+91) 8436500886
