Does setting UseAsync on the C++ end make a difference? It's possible we switched the default to async in python in 6.0.0 but not in C++.
On Tue, Mar 1, 2022, 11:35 Niranda Perera <niranda.per...@gmail.com> wrote: > Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++ > executable. > LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench > > On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <niranda.per...@gmail.com> > wrote: > >> @Jayeet, >> >> I ran your example in my desktop, and I don't see any timing issues >> there. I used conda to install pyarrow==6.0.0 >> I used the following command >> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include >> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench >> >> And I had to del the objects in the python file, because it was getting >> killed due to OOM. >> ``` >> ... >> for i in range(10): >> s = time.time() >> dataset_ = ds.dataset("/home/niranda/flight_dataset", >> format="parquet") >> table = dataset_.to_table(use_threads=False) >> e = time.time() >> print(e - s) >> >> del table >> del dataset_ >> gc.collect() >> ``` >> >> For me c++ takes around ~21s and python ~22s which is expected. >> >> >> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty < >> jayjeetchakrabort...@gmail.com> wrote: >> >>> Hi Sasha, >>> >>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I >>> tried it again (when compiling with PyArrow SO files) and unfortunately, it >>> didn't improve the results. >>> >>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky < >>> krassovskysa...@gmail.com> wrote: >>> >>>> Hi Jayjeet, >>>> I noticed that you're not compiling dataset_bench with optimizations >>>> enabled. I'm not sure how much it will help, but it may be worth adding >>>> `-O2` to your g++ invocation. >>>> >>>> Sasha Krassovsky >>>> >>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < >>>> jayjeetchakrabort...@gmail.com> wrote: >>>> >>>>> Hi Niranda, David, >>>>> >>>>> I ran my benchmarks again with the PyArrow .SO libraries which should >>>>> be optimized. PyArrow version was 6.0.1 installed from pip. Here are my >>>>> new >>>>> results [1]. Numbers didn't quite seem to improve. You can check my build >>>>> config in the Makefile [2]. I created a README [3] to make it easy for you >>>>> to reproduce on your end. Thanks. >>>>> >>>>> [1] >>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized >>>>> [2] >>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile >>>>> [3] >>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md >>>>> >>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera < >>>>> niranda.per...@gmail.com> wrote: >>>>> >>>>>> Hi Jayeet, >>>>>> >>>>>> Could you try building your cpp project against the arrow.so in >>>>>> pyarrow installation? It should be in the lib directory in your python >>>>>> environment. >>>>>> >>>>>> Best >>>>>> >>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >>>>>> jayjeetchakrabort...@gmail.com> wrote: >>>>>> >>>>>>> Thanks for your reply, David. >>>>>>> >>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>>>>>> 2) The dataset was deployed using this [1] script. >>>>>>> 3) For C++, Arrow was built from source in release mode. You can see >>>>>>> the CMake config here [2]. >>>>>>> >>>>>>> I think I need to test once with Arrow C++ installed from packages >>>>>>> instead of me building it. That might be the issue. >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>>>>>> [2] >>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>>>>>> >>>>>>> Best, >>>>>>> Jayjeet >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <lidav...@apache.org> wrote: >>>>>>> >>>>>>>> Hi Jayjeet, >>>>>>>> >>>>>>>> That's odd since the Python API is just wrapping the C++ API, so >>>>>>>> they should be identical if everything is configured the same. (So is >>>>>>>> the >>>>>>>> Java API, incidentally.) That's effectively what the SO question is >>>>>>>> saying. >>>>>>>> >>>>>>>> What versions of PyArrow and Arrow are you using? Just to check the >>>>>>>> obvious things, was Arrow compiled with optimizations? And if we want >>>>>>>> to >>>>>>>> replicate this, is it possible to get the dataset? >>>>>>>> >>>>>>>> -David >>>>>>>> >>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>>>>>> >>>>>>>> Hi Arrow community, >>>>>>>> >>>>>>>> I was working on a class project for benchmarking Apache Arrow >>>>>>>> Dataset API in different programming languages. I found out that for >>>>>>>> some >>>>>>>> reason the C++ API example is slower than the Python API example. I >>>>>>>> ran my >>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I >>>>>>>> tried >>>>>>>> my best to cross verify if all the parameters are similar in the >>>>>>>> Python and >>>>>>>> C++ examples. It would be great to know if someone had similar >>>>>>>> observations >>>>>>>> in the past and if the reason for this is known. I would really like to >>>>>>>> know more about this phenomenon. You can find the code and the results >>>>>>>> here >>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the >>>>>>>> exact >>>>>>>> reason. Thanks a lot for your help. >>>>>>>> >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>>>>>> >>>>>>>> [2] >>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> *Jayjeet Chakraborty* >>>>>>>> Ph.D. Student >>>>>>>> Department of Computer Science and Engineering >>>>>>>> University of California, Santa Cruz >>>>>>>> >>>>>>>> -- >>>>>>>> *Jayjeet Chakraborty* >>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>> National Institute Of Technology, Durgapur >>>>>>>> West Bengal, India >>>>>>>> M: (+91) 8436500886 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Jayjeet Chakraborty* >>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>> National Institute Of Technology, Durgapur >>>>>>> West Bengal, India >>>>>>> M: (+91) 8436500886 >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Niranda Perera >>>>>> https://niranda.dev/ >>>>>> @n1r44 <https://twitter.com/N1R44> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> *Jayjeet Chakraborty* >>>>> B.Tech in Computer Sc. and Engineering >>>>> National Institute Of Technology, Durgapur >>>>> West Bengal, India >>>>> M: (+91) 8436500886 >>>>> >>>> >>> >>> -- >>> *Jayjeet Chakraborty* >>> CS PhD student >>> UC Santa Cruz >>> California, USA >>> >>> >> >> -- >> Niranda Perera >> https://niranda.dev/ >> @n1r44 <https://twitter.com/N1R44> >> >> > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > >