Hi Sasha, Thanks a lot for replying. I tried -O2 earlier but it didn't work. I tried it again (when compiling with PyArrow SO files) and unfortunately, it didn't improve the results.
On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <[email protected]> wrote: > Hi Jayjeet, > I noticed that you're not compiling dataset_bench with optimizations > enabled. I'm not sure how much it will help, but it may be worth adding > `-O2` to your g++ invocation. > > Sasha Krassovsky > > On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < > [email protected]> wrote: > >> Hi Niranda, David, >> >> I ran my benchmarks again with the PyArrow .SO libraries which should be >> optimized. PyArrow version was 6.0.1 installed from pip. Here are my new >> results [1]. Numbers didn't quite seem to improve. You can check my build >> config in the Makefile [2]. I created a README [3] to make it easy for you >> to reproduce on your end. Thanks. >> >> [1] >> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized >> [2] >> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile >> [3] >> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md >> >> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <[email protected]> >> wrote: >> >>> Hi Jayeet, >>> >>> Could you try building your cpp project against the arrow.so in pyarrow >>> installation? It should be in the lib directory in your python environment. >>> >>> Best >>> >>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >>> [email protected]> wrote: >>> >>>> Thanks for your reply, David. >>>> >>>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>>> 2) The dataset was deployed using this [1] script. >>>> 3) For C++, Arrow was built from source in release mode. You can see >>>> the CMake config here [2]. >>>> >>>> I think I need to test once with Arrow C++ installed from packages >>>> instead of me building it. That might be the issue. >>>> >>>> [1] >>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>>> [2] >>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>>> >>>> Best, >>>> Jayjeet >>>> >>>> >>>> >>>> >>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote: >>>> >>>>> Hi Jayjeet, >>>>> >>>>> That's odd since the Python API is just wrapping the C++ API, so they >>>>> should be identical if everything is configured the same. (So is the Java >>>>> API, incidentally.) That's effectively what the SO question is saying. >>>>> >>>>> What versions of PyArrow and Arrow are you using? Just to check the >>>>> obvious things, was Arrow compiled with optimizations? And if we want to >>>>> replicate this, is it possible to get the dataset? >>>>> >>>>> -David >>>>> >>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>>> >>>>> Hi Arrow community, >>>>> >>>>> I was working on a class project for benchmarking Apache Arrow Dataset >>>>> API in different programming languages. I found out that for some reason >>>>> the C++ API example is slower than the Python API example. I ran my >>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I tried >>>>> my best to cross verify if all the parameters are similar in the Python >>>>> and >>>>> C++ examples. It would be great to know if someone had similar >>>>> observations >>>>> in the past and if the reason for this is known. I would really like to >>>>> know more about this phenomenon. You can find the code and the results >>>>> here >>>>> [1]. I found a similar issue here [2] but I couldn't understand the exact >>>>> reason. Thanks a lot for your help. >>>>> >>>>> >>>>> [1] >>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>>> >>>>> [2] >>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>>> >>>>> Best Regards, >>>>> *Jayjeet Chakraborty* >>>>> Ph.D. Student >>>>> Department of Computer Science and Engineering >>>>> University of California, Santa Cruz >>>>> >>>>> -- >>>>> *Jayjeet Chakraborty* >>>>> B.Tech in Computer Sc. and Engineering >>>>> National Institute Of Technology, Durgapur >>>>> West Bengal, India >>>>> M: (+91) 8436500886 >>>>> >>>>> >>>>> >>>> >>>> -- >>>> *Jayjeet Chakraborty* >>>> B.Tech in Computer Sc. and Engineering >>>> National Institute Of Technology, Durgapur >>>> West Bengal, India >>>> M: (+91) 8436500886 >>>> >>> >>> >>> -- >>> Niranda Perera >>> https://niranda.dev/ >>> @n1r44 <https://twitter.com/N1R44> >>> >>> >> >> -- >> *Jayjeet Chakraborty* >> B.Tech in Computer Sc. and Engineering >> National Institute Of Technology, Durgapur >> West Bengal, India >> M: (+91) 8436500886 >> > -- *Jayjeet Chakraborty* CS PhD student UC Santa Cruz California, USA
