Hi Sasha,

Thanks a lot for replying. I tried -O2 earlier but it didn't work. I tried
it again (when compiling with PyArrow SO files) and unfortunately, it
didn't improve the results.

On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <[email protected]>
wrote:

> Hi Jayjeet,
> I noticed that you're not compiling dataset_bench with optimizations
> enabled. I'm not sure how much it will help, but it may be worth adding
> `-O2` to your g++ invocation.
>
> Sasha Krassovsky
>
> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <
> [email protected]> wrote:
>
>> Hi Niranda, David,
>>
>> I ran my benchmarks again with the PyArrow .SO libraries which should be
>> optimized. PyArrow version was 6.0.1 installed from pip. Here are my new
>> results [1]. Numbers didn't quite seem to improve. You can check my build
>> config in the Makefile [2]. I created a README [3] to make it easy for you
>> to reproduce on your end. Thanks.
>>
>> [1]
>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
>> [2]
>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
>> [3]
>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
>>
>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <[email protected]>
>> wrote:
>>
>>> Hi Jayeet,
>>>
>>> Could you try building your cpp project against the arrow.so in pyarrow
>>> installation? It should be in the lib directory in your python environment.
>>>
>>> Best
>>>
>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
>>> [email protected]> wrote:
>>>
>>>> Thanks for your reply, David.
>>>>
>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>>>> 2) The dataset was deployed using this [1] script.
>>>> 3) For C++, Arrow was built from source in release mode. You can see
>>>> the CMake config here [2].
>>>>
>>>> I think I need to test once with Arrow C++ installed from packages
>>>> instead of me building it. That might be the issue.
>>>>
>>>> [1]
>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>>>> [2]
>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>>>
>>>> Best,
>>>> Jayjeet
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote:
>>>>
>>>>> Hi Jayjeet,
>>>>>
>>>>> That's odd since the Python API is just wrapping the C++ API, so they
>>>>> should be identical if everything is configured the same. (So is the Java
>>>>> API, incidentally.) That's effectively what the SO question is saying.
>>>>>
>>>>> What versions of PyArrow and Arrow are you using? Just to check the
>>>>> obvious things, was Arrow compiled with optimizations? And if we want to
>>>>> replicate this, is it possible to get the dataset?
>>>>>
>>>>> -David
>>>>>
>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>>>
>>>>> Hi Arrow community,
>>>>>
>>>>> I was working on a class project for benchmarking Apache Arrow Dataset
>>>>> API in different programming languages. I found out that for some reason
>>>>> the C++ API example is slower than the Python API example. I ran my
>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I tried
>>>>> my best to cross verify if all the parameters are similar in the Python 
>>>>> and
>>>>> C++ examples. It would be great to know if someone had similar 
>>>>> observations
>>>>> in the past and if the reason for this is known. I would really like to
>>>>> know more about this phenomenon. You can find the code and the results 
>>>>> here
>>>>> [1]. I found a similar issue here [2] but I couldn't understand the exact
>>>>> reason. Thanks a lot for your help.
>>>>>
>>>>>
>>>>> [1]
>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>>>
>>>>> [2]
>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>>>
>>>>> Best Regards,
>>>>> *Jayjeet Chakraborty*
>>>>> Ph.D. Student
>>>>> Department of Computer Science and Engineering
>>>>> University of California, Santa Cruz
>>>>>
>>>>> --
>>>>> *Jayjeet Chakraborty*
>>>>> B.Tech in Computer Sc. and Engineering
>>>>> National Institute Of Technology, Durgapur
>>>>> West Bengal, India
>>>>> M: (+91) 8436500886
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> *Jayjeet Chakraborty*
>>>> B.Tech in Computer Sc. and Engineering
>>>> National Institute Of Technology, Durgapur
>>>> West Bengal, India
>>>> M: (+91) 8436500886
>>>>
>>>
>>>
>>> --
>>> Niranda Perera
>>> https://niranda.dev/
>>> @n1r44 <https://twitter.com/N1R44>
>>>
>>>
>>
>> --
>> *Jayjeet Chakraborty*
>> B.Tech in Computer Sc. and Engineering
>> National Institute Of Technology, Durgapur
>> West Bengal, India
>> M: (+91) 8436500886
>>
>

-- 
*Jayjeet Chakraborty*
CS PhD student
UC Santa Cruz
California, USA

Reply via email to