Re: C++ version of Arrow slower than Python version

Weston Pace Tue, 01 Mar 2022 15:04:36 -0800

Does setting UseAsync on the C++ end make a difference?  It's possible we
switched the default to async in python in 6.0.0 but not in C++.


On Tue, Mar 1, 2022, 11:35 Niranda Perera <niranda.per...@gmail.com> wrote:

> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++
> executable.
> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench
>
> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <niranda.per...@gmail.com>
> wrote:
>
>> @Jayeet,
>>
>> I ran your example in my desktop, and I don't see any timing issues
>> there. I used conda to install pyarrow==6.0.0
>> I used the following command
>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include
>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench
>>
>> And I had to del the objects in the python file, because it was getting
>> killed due to OOM.
>> ```
>> ...
>>     for i in range(10):
>>         s = time.time()
>>         dataset_ = ds.dataset("/home/niranda/flight_dataset",
>> format="parquet")
>>         table = dataset_.to_table(use_threads=False)
>>         e = time.time()
>>         print(e - s)
>>
>>         del table
>>         del dataset_
>>         gc.collect()
>> ```
>>
>> For me c++ takes around ~21s and python ~22s which is expected.
>>
>>
>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty <
>> jayjeetchakrabort...@gmail.com> wrote:
>>
>>> Hi Sasha,
>>>
>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I
>>> tried it again (when compiling with PyArrow SO files) and unfortunately, it
>>> didn't improve the results.
>>>
>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <
>>> krassovskysa...@gmail.com> wrote:
>>>
>>>> Hi Jayjeet,
>>>> I noticed that you're not compiling dataset_bench with optimizations
>>>> enabled. I'm not sure how much it will help, but it may be worth adding
>>>> `-O2` to your g++ invocation.
>>>>
>>>> Sasha Krassovsky
>>>>
>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <
>>>> jayjeetchakrabort...@gmail.com> wrote:
>>>>
>>>>> Hi Niranda, David,
>>>>>
>>>>> I ran my benchmarks again with the PyArrow .SO libraries which should
>>>>> be optimized. PyArrow version was 6.0.1 installed from pip. Here are my 
>>>>> new
>>>>> results [1]. Numbers didn't quite seem to improve. You can check my build
>>>>> config in the Makefile [2]. I created a README [3] to make it easy for you
>>>>> to reproduce on your end. Thanks.
>>>>>
>>>>> [1]
>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
>>>>> [2]
>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
>>>>> [3]
>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
>>>>>
>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <
>>>>> niranda.per...@gmail.com> wrote:
>>>>>
>>>>>> Hi Jayeet,
>>>>>>
>>>>>> Could you try building your cpp project against the arrow.so in
>>>>>> pyarrow installation? It should be in the lib directory in your python
>>>>>> environment.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
>>>>>> jayjeetchakrabort...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for your reply, David.
>>>>>>>
>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>>>>>>> 2) The dataset was deployed using this [1] script.
>>>>>>> 3) For C++, Arrow was built from source in release mode. You can see
>>>>>>> the CMake config here [2].
>>>>>>>
>>>>>>> I think I need to test once with Arrow C++ installed from packages
>>>>>>> instead of me building it. That might be the issue.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>>>>>>> [2]
>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>>>>>>
>>>>>>> Best,
>>>>>>> Jayjeet
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <lidav...@apache.org> wrote:
>>>>>>>
>>>>>>>> Hi Jayjeet,
>>>>>>>>
>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so
>>>>>>>> they should be identical if everything is configured the same. (So is 
>>>>>>>> the
>>>>>>>> Java API, incidentally.) That's effectively what the SO question is 
>>>>>>>> saying.
>>>>>>>>
>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check the
>>>>>>>> obvious things, was Arrow compiled with optimizations? And if we want 
>>>>>>>> to
>>>>>>>> replicate this, is it possible to get the dataset?
>>>>>>>>
>>>>>>>> -David
>>>>>>>>
>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>>>>>>
>>>>>>>> Hi Arrow community,
>>>>>>>>
>>>>>>>> I was working on a class project for benchmarking Apache Arrow
>>>>>>>> Dataset API in different programming languages. I found out that for 
>>>>>>>> some
>>>>>>>> reason the C++ API example is slower than the Python API example. I 
>>>>>>>> ran my
>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I 
>>>>>>>> tried
>>>>>>>> my best to cross verify if all the parameters are similar in the 
>>>>>>>> Python and
>>>>>>>> C++ examples. It would be great to know if someone had similar 
>>>>>>>> observations
>>>>>>>> in the past and if the reason for this is known. I would really like to
>>>>>>>> know more about this phenomenon. You can find the code and the results 
>>>>>>>> here
>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the 
>>>>>>>> exact
>>>>>>>> reason. Thanks a lot for your help.
>>>>>>>>
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>>>>>>
>>>>>>>> [2]
>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>> Ph.D. Student
>>>>>>>> Department of Computer Science and Engineering
>>>>>>>> University of California, Santa Cruz
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>>> National Institute Of Technology, Durgapur
>>>>>>>> West Bengal, India
>>>>>>>> M: (+91) 8436500886
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Jayjeet Chakraborty*
>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>> National Institute Of Technology, Durgapur
>>>>>>> West Bengal, India
>>>>>>> M: (+91) 8436500886
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Niranda Perera
>>>>>> https://niranda.dev/
>>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> *Jayjeet Chakraborty*
>>>>> B.Tech in Computer Sc. and Engineering
>>>>> National Institute Of Technology, Durgapur
>>>>> West Bengal, India
>>>>> M: (+91) 8436500886
>>>>>
>>>>
>>>
>>> --
>>> *Jayjeet Chakraborty*
>>> CS PhD student
>>> UC Santa Cruz
>>> California, USA
>>>
>>>
>>
>> --
>> Niranda Perera
>> https://niranda.dev/
>> @n1r44 <https://twitter.com/N1R44>
>>
>>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

Re: C++ version of Arrow slower than Python version

Reply via email to