Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++
executable.
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench

On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <[email protected]>
wrote:

> @Jayeet,
>
> I ran your example in my desktop, and I don't see any timing issues there.
> I used conda to install pyarrow==6.0.0
> I used the following command
> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include
> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench
>
> And I had to del the objects in the python file, because it was getting
> killed due to OOM.
> ```
> ...
>     for i in range(10):
>         s = time.time()
>         dataset_ = ds.dataset("/home/niranda/flight_dataset",
> format="parquet")
>         table = dataset_.to_table(use_threads=False)
>         e = time.time()
>         print(e - s)
>
>         del table
>         del dataset_
>         gc.collect()
> ```
>
> For me c++ takes around ~21s and python ~22s which is expected.
>
>
> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty <
> [email protected]> wrote:
>
>> Hi Sasha,
>>
>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I
>> tried it again (when compiling with PyArrow SO files) and unfortunately, it
>> didn't improve the results.
>>
>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <
>> [email protected]> wrote:
>>
>>> Hi Jayjeet,
>>> I noticed that you're not compiling dataset_bench with optimizations
>>> enabled. I'm not sure how much it will help, but it may be worth adding
>>> `-O2` to your g++ invocation.
>>>
>>> Sasha Krassovsky
>>>
>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <
>>> [email protected]> wrote:
>>>
>>>> Hi Niranda, David,
>>>>
>>>> I ran my benchmarks again with the PyArrow .SO libraries which should
>>>> be optimized. PyArrow version was 6.0.1 installed from pip. Here are my new
>>>> results [1]. Numbers didn't quite seem to improve. You can check my build
>>>> config in the Makefile [2]. I created a README [3] to make it easy for you
>>>> to reproduce on your end. Thanks.
>>>>
>>>> [1]
>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
>>>> [2]
>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
>>>> [3]
>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
>>>>
>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Jayeet,
>>>>>
>>>>> Could you try building your cpp project against the arrow.so in
>>>>> pyarrow installation? It should be in the lib directory in your python
>>>>> environment.
>>>>>
>>>>> Best
>>>>>
>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks for your reply, David.
>>>>>>
>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>>>>>> 2) The dataset was deployed using this [1] script.
>>>>>> 3) For C++, Arrow was built from source in release mode. You can see
>>>>>> the CMake config here [2].
>>>>>>
>>>>>> I think I need to test once with Arrow C++ installed from packages
>>>>>> instead of me building it. That might be the issue.
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>>>>>> [2]
>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>>>>>
>>>>>> Best,
>>>>>> Jayjeet
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Jayjeet,
>>>>>>>
>>>>>>> That's odd since the Python API is just wrapping the C++ API, so
>>>>>>> they should be identical if everything is configured the same. (So is 
>>>>>>> the
>>>>>>> Java API, incidentally.) That's effectively what the SO question is 
>>>>>>> saying.
>>>>>>>
>>>>>>> What versions of PyArrow and Arrow are you using? Just to check the
>>>>>>> obvious things, was Arrow compiled with optimizations? And if we want to
>>>>>>> replicate this, is it possible to get the dataset?
>>>>>>>
>>>>>>> -David
>>>>>>>
>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>>>>>
>>>>>>> Hi Arrow community,
>>>>>>>
>>>>>>> I was working on a class project for benchmarking Apache Arrow
>>>>>>> Dataset API in different programming languages. I found out that for 
>>>>>>> some
>>>>>>> reason the C++ API example is slower than the Python API example. I ran 
>>>>>>> my
>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I 
>>>>>>> tried
>>>>>>> my best to cross verify if all the parameters are similar in the Python 
>>>>>>> and
>>>>>>> C++ examples. It would be great to know if someone had similar 
>>>>>>> observations
>>>>>>> in the past and if the reason for this is known. I would really like to
>>>>>>> know more about this phenomenon. You can find the code and the results 
>>>>>>> here
>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the 
>>>>>>> exact
>>>>>>> reason. Thanks a lot for your help.
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>>>>>
>>>>>>> [2]
>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> *Jayjeet Chakraborty*
>>>>>>> Ph.D. Student
>>>>>>> Department of Computer Science and Engineering
>>>>>>> University of California, Santa Cruz
>>>>>>>
>>>>>>> --
>>>>>>> *Jayjeet Chakraborty*
>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>> National Institute Of Technology, Durgapur
>>>>>>> West Bengal, India
>>>>>>> M: (+91) 8436500886
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Jayjeet Chakraborty*
>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>> National Institute Of Technology, Durgapur
>>>>>> West Bengal, India
>>>>>> M: (+91) 8436500886
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Niranda Perera
>>>>> https://niranda.dev/
>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>>
>>>>
>>>> --
>>>> *Jayjeet Chakraborty*
>>>> B.Tech in Computer Sc. and Engineering
>>>> National Institute Of Technology, Durgapur
>>>> West Bengal, India
>>>> M: (+91) 8436500886
>>>>
>>>
>>
>> --
>> *Jayjeet Chakraborty*
>> CS PhD student
>> UC Santa Cruz
>> California, USA
>>
>>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to