I think you should try release build mode! On Wed, Mar 2, 2022 at 12:21 PM Jayjeet Chakraborty < [email protected]> wrote:
> Thanks for all the help everyone. I was able to follow Niranda's steps and > get the same perf in both C++ and Python. But I still don't know which are > essential optimizations for compiling Arrow in C++. Can anyone please share > some pointers on that ? I think documenting the essential C++ optimizations > in some way will help people in the future. Thanks again. > > On Tue, Mar 1, 2022 at 3:04 PM Weston Pace <[email protected]> wrote: > >> Does setting UseAsync on the C++ end make a difference? It's possible we >> switched the default to async in python in 6.0.0 but not in C++. >> >> On Tue, Mar 1, 2022, 11:35 Niranda Perera <[email protected]> >> wrote: >> >>> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++ >>> executable. >>> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench >>> >>> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <[email protected]> >>> wrote: >>> >>>> @Jayeet, >>>> >>>> I ran your example in my desktop, and I don't see any timing issues >>>> there. I used conda to install pyarrow==6.0.0 >>>> I used the following command >>>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include >>>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench >>>> >>>> And I had to del the objects in the python file, because it was getting >>>> killed due to OOM. >>>> ``` >>>> ... >>>> for i in range(10): >>>> s = time.time() >>>> dataset_ = ds.dataset("/home/niranda/flight_dataset", >>>> format="parquet") >>>> table = dataset_.to_table(use_threads=False) >>>> e = time.time() >>>> print(e - s) >>>> >>>> del table >>>> del dataset_ >>>> gc.collect() >>>> ``` >>>> >>>> For me c++ takes around ~21s and python ~22s which is expected. >>>> >>>> >>>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty < >>>> [email protected]> wrote: >>>> >>>>> Hi Sasha, >>>>> >>>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I >>>>> tried it again (when compiling with PyArrow SO files) and unfortunately, >>>>> it >>>>> didn't improve the results. >>>>> >>>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Jayjeet, >>>>>> I noticed that you're not compiling dataset_bench with optimizations >>>>>> enabled. I'm not sure how much it will help, but it may be worth adding >>>>>> `-O2` to your g++ invocation. >>>>>> >>>>>> Sasha Krassovsky >>>>>> >>>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Niranda, David, >>>>>>> >>>>>>> I ran my benchmarks again with the PyArrow .SO libraries which >>>>>>> should be optimized. PyArrow version was 6.0.1 installed from pip. Here >>>>>>> are >>>>>>> my new results [1]. Numbers didn't quite seem to improve. You can check >>>>>>> my >>>>>>> build config in the Makefile [2]. I created a README [3] to make it easy >>>>>>> for you to reproduce on your end. Thanks. >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized >>>>>>> [2] >>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile >>>>>>> [3] >>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md >>>>>>> >>>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Jayeet, >>>>>>>> >>>>>>>> Could you try building your cpp project against the arrow.so in >>>>>>>> pyarrow installation? It should be in the lib directory in your python >>>>>>>> environment. >>>>>>>> >>>>>>>> Best >>>>>>>> >>>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks for your reply, David. >>>>>>>>> >>>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>>>>>>>> 2) The dataset was deployed using this [1] script. >>>>>>>>> 3) For C++, Arrow was built from source in release mode. You can >>>>>>>>> see the CMake config here [2]. >>>>>>>>> >>>>>>>>> I think I need to test once with Arrow C++ installed from packages >>>>>>>>> instead of me building it. That might be the issue. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>>>>>>>> [2] >>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Jayjeet >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Jayjeet, >>>>>>>>>> >>>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so >>>>>>>>>> they should be identical if everything is configured the same. (So >>>>>>>>>> is the >>>>>>>>>> Java API, incidentally.) That's effectively what the SO question is >>>>>>>>>> saying. >>>>>>>>>> >>>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check >>>>>>>>>> the obvious things, was Arrow compiled with optimizations? And if we >>>>>>>>>> want >>>>>>>>>> to replicate this, is it possible to get the dataset? >>>>>>>>>> >>>>>>>>>> -David >>>>>>>>>> >>>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>>>>>>>> >>>>>>>>>> Hi Arrow community, >>>>>>>>>> >>>>>>>>>> I was working on a class project for benchmarking Apache Arrow >>>>>>>>>> Dataset API in different programming languages. I found out that for >>>>>>>>>> some >>>>>>>>>> reason the C++ API example is slower than the Python API example. I >>>>>>>>>> ran my >>>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I >>>>>>>>>> tried >>>>>>>>>> my best to cross verify if all the parameters are similar in the >>>>>>>>>> Python and >>>>>>>>>> C++ examples. It would be great to know if someone had similar >>>>>>>>>> observations >>>>>>>>>> in the past and if the reason for this is known. I would really like >>>>>>>>>> to >>>>>>>>>> know more about this phenomenon. You can find the code and the >>>>>>>>>> results here >>>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the >>>>>>>>>> exact >>>>>>>>>> reason. Thanks a lot for your help. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>>>>>>>> >>>>>>>>>> [2] >>>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>>>>>>>> >>>>>>>>>> Best Regards, >>>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>>> Ph.D. Student >>>>>>>>>> Department of Computer Science and Engineering >>>>>>>>>> University of California, Santa Cruz >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>>>> National Institute Of Technology, Durgapur >>>>>>>>>> West Bengal, India >>>>>>>>>> M: (+91) 8436500886 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>>> National Institute Of Technology, Durgapur >>>>>>>>> West Bengal, India >>>>>>>>> M: (+91) 8436500886 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Niranda Perera >>>>>>>> https://niranda.dev/ >>>>>>>> @n1r44 <https://twitter.com/N1R44> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Jayjeet Chakraborty* >>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>> National Institute Of Technology, Durgapur >>>>>>> West Bengal, India >>>>>>> M: (+91) 8436500886 >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> *Jayjeet Chakraborty* >>>>> CS PhD student >>>>> UC Santa Cruz >>>>> California, USA >>>>> >>>>> >>>> >>>> -- >>>> Niranda Perera >>>> https://niranda.dev/ >>>> @n1r44 <https://twitter.com/N1R44> >>>> >>>> >>> >>> -- >>> Niranda Perera >>> https://niranda.dev/ >>> @n1r44 <https://twitter.com/N1R44> >>> >>> > > -- > *Jayjeet Chakraborty* > CS PhD student > UC Santa Cruz > California, USA > > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>
