I think you should try release build mode!

On Wed, Mar 2, 2022 at 12:21 PM Jayjeet Chakraborty <
[email protected]> wrote:

> Thanks for all the help everyone. I was able to follow Niranda's steps and
> get the same perf in both C++ and Python. But I still don't know which are
> essential optimizations for compiling Arrow in C++. Can anyone please share
> some pointers on that ? I think documenting the essential C++ optimizations
> in some way will help people in the future. Thanks again.
>
> On Tue, Mar 1, 2022 at 3:04 PM Weston Pace <[email protected]> wrote:
>
>> Does setting UseAsync on the C++ end make a difference?  It's possible we
>> switched the default to async in python in 6.0.0 but not in C++.
>>
>> On Tue, Mar 1, 2022, 11:35 Niranda Perera <[email protected]>
>> wrote:
>>
>>> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++
>>> executable.
>>> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench
>>>
>>> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <[email protected]>
>>> wrote:
>>>
>>>> @Jayeet,
>>>>
>>>> I ran your example in my desktop, and I don't see any timing issues
>>>> there. I used conda to install pyarrow==6.0.0
>>>> I used the following command
>>>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include
>>>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench
>>>>
>>>> And I had to del the objects in the python file, because it was getting
>>>> killed due to OOM.
>>>> ```
>>>> ...
>>>>     for i in range(10):
>>>>         s = time.time()
>>>>         dataset_ = ds.dataset("/home/niranda/flight_dataset",
>>>> format="parquet")
>>>>         table = dataset_.to_table(use_threads=False)
>>>>         e = time.time()
>>>>         print(e - s)
>>>>
>>>>         del table
>>>>         del dataset_
>>>>         gc.collect()
>>>> ```
>>>>
>>>> For me c++ takes around ~21s and python ~22s which is expected.
>>>>
>>>>
>>>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Sasha,
>>>>>
>>>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I
>>>>> tried it again (when compiling with PyArrow SO files) and unfortunately, 
>>>>> it
>>>>> didn't improve the results.
>>>>>
>>>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Jayjeet,
>>>>>> I noticed that you're not compiling dataset_bench with optimizations
>>>>>> enabled. I'm not sure how much it will help, but it may be worth adding
>>>>>> `-O2` to your g++ invocation.
>>>>>>
>>>>>> Sasha Krassovsky
>>>>>>
>>>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Niranda, David,
>>>>>>>
>>>>>>> I ran my benchmarks again with the PyArrow .SO libraries which
>>>>>>> should be optimized. PyArrow version was 6.0.1 installed from pip. Here 
>>>>>>> are
>>>>>>> my new results [1]. Numbers didn't quite seem to improve. You can check 
>>>>>>> my
>>>>>>> build config in the Makefile [2]. I created a README [3] to make it easy
>>>>>>> for you to reproduce on your end. Thanks.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
>>>>>>> [2]
>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
>>>>>>> [3]
>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
>>>>>>>
>>>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Jayeet,
>>>>>>>>
>>>>>>>> Could you try building your cpp project against the arrow.so in
>>>>>>>> pyarrow installation? It should be in the lib directory in your python
>>>>>>>> environment.
>>>>>>>>
>>>>>>>> Best
>>>>>>>>
>>>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for your reply, David.
>>>>>>>>>
>>>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>>>>>>>>> 2) The dataset was deployed using this [1] script.
>>>>>>>>> 3) For C++, Arrow was built from source in release mode. You can
>>>>>>>>> see the CMake config here [2].
>>>>>>>>>
>>>>>>>>> I think I need to test once with Arrow C++ installed from packages
>>>>>>>>> instead of me building it. That might be the issue.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>>>>>>>>> [2]
>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jayjeet
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jayjeet,
>>>>>>>>>>
>>>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so
>>>>>>>>>> they should be identical if everything is configured the same. (So 
>>>>>>>>>> is the
>>>>>>>>>> Java API, incidentally.) That's effectively what the SO question is 
>>>>>>>>>> saying.
>>>>>>>>>>
>>>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check
>>>>>>>>>> the obvious things, was Arrow compiled with optimizations? And if we 
>>>>>>>>>> want
>>>>>>>>>> to replicate this, is it possible to get the dataset?
>>>>>>>>>>
>>>>>>>>>> -David
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Arrow community,
>>>>>>>>>>
>>>>>>>>>> I was working on a class project for benchmarking Apache Arrow
>>>>>>>>>> Dataset API in different programming languages. I found out that for 
>>>>>>>>>> some
>>>>>>>>>> reason the C++ API example is slower than the Python API example. I 
>>>>>>>>>> ran my
>>>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I 
>>>>>>>>>> tried
>>>>>>>>>> my best to cross verify if all the parameters are similar in the 
>>>>>>>>>> Python and
>>>>>>>>>> C++ examples. It would be great to know if someone had similar 
>>>>>>>>>> observations
>>>>>>>>>> in the past and if the reason for this is known. I would really like 
>>>>>>>>>> to
>>>>>>>>>> know more about this phenomenon. You can find the code and the 
>>>>>>>>>> results here
>>>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the 
>>>>>>>>>> exact
>>>>>>>>>> reason. Thanks a lot for your help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>>>>>>>>
>>>>>>>>>> [2]
>>>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>>>> Ph.D. Student
>>>>>>>>>> Department of Computer Science and Engineering
>>>>>>>>>> University of California, Santa Cruz
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>>>>> National Institute Of Technology, Durgapur
>>>>>>>>>> West Bengal, India
>>>>>>>>>> M: (+91) 8436500886
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>>>> National Institute Of Technology, Durgapur
>>>>>>>>> West Bengal, India
>>>>>>>>> M: (+91) 8436500886
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Niranda Perera
>>>>>>>> https://niranda.dev/
>>>>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Jayjeet Chakraborty*
>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>> National Institute Of Technology, Durgapur
>>>>>>> West Bengal, India
>>>>>>> M: (+91) 8436500886
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> *Jayjeet Chakraborty*
>>>>> CS PhD student
>>>>> UC Santa Cruz
>>>>> California, USA
>>>>>
>>>>>
>>>>
>>>> --
>>>> Niranda Perera
>>>> https://niranda.dev/
>>>> @n1r44 <https://twitter.com/N1R44>
>>>>
>>>>
>>>
>>> --
>>> Niranda Perera
>>> https://niranda.dev/
>>> @n1r44 <https://twitter.com/N1R44>
>>>
>>>
>
> --
> *Jayjeet Chakraborty*
> CS PhD student
> UC Santa Cruz
> California, USA
>
>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to