On Wed, 2 Mar 2022 09:20:50 -0800
Jayjeet Chakraborty <[email protected]> wrote:
> Thanks for all the help everyone. I was able to follow Niranda's steps and
> get the same perf in both C++ and Python. But I still don't know which are
> essential optimizations for compiling Arrow in C++. Can anyone please share
> some pointers on that ? I think documenting the essential C++ optimizations
> in some way will help people in the future. Thanks again.

You should simply compile in release mode and appropriate compilation
options will be selected for you ("cmake -DCMAKE_BUILD_TYPE=Release
...").

Regards

Antoine.


> 
> On Tue, Mar 1, 2022 at 3:04 PM Weston Pace <[email protected]> wrote:
> 
> > Does setting UseAsync on the C++ end make a difference?  It's possible we
> > switched the default to async in python in 6.0.0 but not in C++.
> >
> > On Tue, Mar 1, 2022, 11:35 Niranda Perera <[email protected]>
> > wrote:
> >  
> >> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++
> >> executable.
> >> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench
> >>
> >> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <[email protected]>
> >> wrote:
> >>  
> >>> @Jayeet,
> >>>
> >>> I ran your example in my desktop, and I don't see any timing issues
> >>> there. I used conda to install pyarrow==6.0.0
> >>> I used the following command
> >>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include
> >>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench
> >>>
> >>> And I had to del the objects in the python file, because it was getting
> >>> killed due to OOM.
> >>> ```
> >>> ...
> >>>     for i in range(10):
> >>>         s = time.time()
> >>>         dataset_ = ds.dataset("/home/niranda/flight_dataset",
> >>> format="parquet")
> >>>         table = dataset_.to_table(use_threads=False)
> >>>         e = time.time()
> >>>         print(e - s)
> >>>
> >>>         del table
> >>>         del dataset_
> >>>         gc.collect()
> >>> ```
> >>>
> >>> For me c++ takes around ~21s and python ~22s which is expected.
> >>>
> >>>
> >>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty <  
> >>> [email protected]> wrote:  
> >>>  
> >>>> Hi Sasha,
> >>>>
> >>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I
> >>>> tried it again (when compiling with PyArrow SO files) and unfortunately, 
> >>>> it
> >>>> didn't improve the results.
> >>>>
> >>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <  
> >>>> [email protected]> wrote:  
> >>>>  
> >>>>> Hi Jayjeet,
> >>>>> I noticed that you're not compiling dataset_bench with optimizations
> >>>>> enabled. I'm not sure how much it will help, but it may be worth adding
> >>>>> `-O2` to your g++ invocation.
> >>>>>
> >>>>> Sasha Krassovsky
> >>>>>
> >>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <  
> >>>>> [email protected]> wrote:  
> >>>>>  
> >>>>>> Hi Niranda, David,
> >>>>>>
> >>>>>> I ran my benchmarks again with the PyArrow .SO libraries which should
> >>>>>> be optimized. PyArrow version was 6.0.1 installed from pip. Here are 
> >>>>>> my new
> >>>>>> results [1]. Numbers didn't quite seem to improve. You can check my 
> >>>>>> build
> >>>>>> config in the Makefile [2]. I created a README [3] to make it easy for 
> >>>>>> you
> >>>>>> to reproduce on your end. Thanks.
> >>>>>>
> >>>>>> [1]
> >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
> >>>>>> [2]
> >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
> >>>>>> [3]
> >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
> >>>>>>
> >>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <  
> >>>>>> [email protected]> wrote:  
> >>>>>>  
> >>>>>>> Hi Jayeet,
> >>>>>>>
> >>>>>>> Could you try building your cpp project against the arrow.so in
> >>>>>>> pyarrow installation? It should be in the lib directory in your python
> >>>>>>> environment.
> >>>>>>>
> >>>>>>> Best
> >>>>>>>
> >>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <  
> >>>>>>> [email protected]> wrote:  
> >>>>>>>  
> >>>>>>>> Thanks for your reply, David.
> >>>>>>>>
> >>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
> >>>>>>>> 2) The dataset was deployed using this [1] script.
> >>>>>>>> 3) For C++, Arrow was built from source in release mode. You can
> >>>>>>>> see the CMake config here [2].
> >>>>>>>>
> >>>>>>>> I think I need to test once with Arrow C++ installed from packages
> >>>>>>>> instead of me building it. That might be the issue.
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
> >>>>>>>> [2]
> >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jayjeet
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>  
> >>>>>>>>> Hi Jayjeet,
> >>>>>>>>>
> >>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so
> >>>>>>>>> they should be identical if everything is configured the same. (So 
> >>>>>>>>> is the
> >>>>>>>>> Java API, incidentally.) That's effectively what the SO question is 
> >>>>>>>>> saying.
> >>>>>>>>>
> >>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check
> >>>>>>>>> the obvious things, was Arrow compiled with optimizations? And if 
> >>>>>>>>> we want
> >>>>>>>>> to replicate this, is it possible to get the dataset?
> >>>>>>>>>
> >>>>>>>>> -David
> >>>>>>>>>
> >>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Arrow community,
> >>>>>>>>>
> >>>>>>>>> I was working on a class project for benchmarking Apache Arrow
> >>>>>>>>> Dataset API in different programming languages. I found out that 
> >>>>>>>>> for some
> >>>>>>>>> reason the C++ API example is slower than the Python API example. I 
> >>>>>>>>> ran my
> >>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. 
> >>>>>>>>> I tried
> >>>>>>>>> my best to cross verify if all the parameters are similar in the 
> >>>>>>>>> Python and
> >>>>>>>>> C++ examples. It would be great to know if someone had similar 
> >>>>>>>>> observations
> >>>>>>>>> in the past and if the reason for this is known. I would really 
> >>>>>>>>> like to
> >>>>>>>>> know more about this phenomenon. You can find the code and the 
> >>>>>>>>> results here
> >>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the 
> >>>>>>>>> exact
> >>>>>>>>> reason. Thanks a lot for your help.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
> >>>>>>>>>
> >>>>>>>>> [2]
> >>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
> >>>>>>>>>
> >>>>>>>>> Best Regards,
> >>>>>>>>> *Jayjeet Chakraborty*
> >>>>>>>>> Ph.D. Student
> >>>>>>>>> Department of Computer Science and Engineering
> >>>>>>>>> University of California, Santa Cruz
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> *Jayjeet Chakraborty*
> >>>>>>>>> B.Tech in Computer Sc. and Engineering
> >>>>>>>>> National Institute Of Technology, Durgapur
> >>>>>>>>> West Bengal, India
> >>>>>>>>> M: (+91) 8436500886
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> *Jayjeet Chakraborty*
> >>>>>>>> B.Tech in Computer Sc. and Engineering
> >>>>>>>> National Institute Of Technology, Durgapur
> >>>>>>>> West Bengal, India
> >>>>>>>> M: (+91) 8436500886
> >>>>>>>>  
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Niranda Perera
> >>>>>>> https://niranda.dev/
> >>>>>>> @n1r44 <https://twitter.com/N1R44>
> >>>>>>>
> >>>>>>>  
> >>>>>>
> >>>>>> --
> >>>>>> *Jayjeet Chakraborty*
> >>>>>> B.Tech in Computer Sc. and Engineering
> >>>>>> National Institute Of Technology, Durgapur
> >>>>>> West Bengal, India
> >>>>>> M: (+91) 8436500886
> >>>>>>  
> >>>>>  
> >>>>
> >>>> --
> >>>> *Jayjeet Chakraborty*
> >>>> CS PhD student
> >>>> UC Santa Cruz
> >>>> California, USA
> >>>>
> >>>>  
> >>>
> >>> --
> >>> Niranda Perera
> >>> https://niranda.dev/
> >>> @n1r44 <https://twitter.com/N1R44>
> >>>
> >>>  
> >>
> >> --
> >> Niranda Perera
> >> https://niranda.dev/
> >> @n1r44 <https://twitter.com/N1R44>
> >>
> >>  
> 



Reply via email to