Hi Niranda, David,

I ran my benchmarks again with the PyArrow .SO libraries which should be
optimized. PyArrow version was 6.0.1 installed from pip. Here are my new
results [1]. Numbers didn't quite seem to improve. You can check my build
config in the Makefile [2]. I created a README [3] to make it easy for you
to reproduce on your end. Thanks.

[1]
https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
[2]
https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
[3]
https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md

On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <[email protected]>
wrote:

> Hi Jayeet,
>
> Could you try building your cpp project against the arrow.so in pyarrow
> installation? It should be in the lib directory in your python environment.
>
> Best
>
> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
> [email protected]> wrote:
>
>> Thanks for your reply, David.
>>
>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>> 2) The dataset was deployed using this [1] script.
>> 3) For C++, Arrow was built from source in release mode. You can see the
>> CMake config here [2].
>>
>> I think I need to test once with Arrow C++ installed from packages
>> instead of me building it. That might be the issue.
>>
>> [1]
>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>> [2]
>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>
>> Best,
>> Jayjeet
>>
>>
>>
>>
>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote:
>>
>>> Hi Jayjeet,
>>>
>>> That's odd since the Python API is just wrapping the C++ API, so they
>>> should be identical if everything is configured the same. (So is the Java
>>> API, incidentally.) That's effectively what the SO question is saying.
>>>
>>> What versions of PyArrow and Arrow are you using? Just to check the
>>> obvious things, was Arrow compiled with optimizations? And if we want to
>>> replicate this, is it possible to get the dataset?
>>>
>>> -David
>>>
>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>
>>> Hi Arrow community,
>>>
>>> I was working on a class project for benchmarking Apache Arrow Dataset
>>> API in different programming languages. I found out that for some reason
>>> the C++ API example is slower than the Python API example. I ran my
>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I tried
>>> my best to cross verify if all the parameters are similar in the Python and
>>> C++ examples. It would be great to know if someone had similar observations
>>> in the past and if the reason for this is known. I would really like to
>>> know more about this phenomenon. You can find the code and the results here
>>> [1]. I found a similar issue here [2] but I couldn't understand the exact
>>> reason. Thanks a lot for your help.
>>>
>>>
>>> [1]
>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>
>>> [2]
>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>
>>> Best Regards,
>>> *Jayjeet Chakraborty*
>>> Ph.D. Student
>>> Department of Computer Science and Engineering
>>> University of California, Santa Cruz
>>>
>>> --
>>> *Jayjeet Chakraborty*
>>> B.Tech in Computer Sc. and Engineering
>>> National Institute Of Technology, Durgapur
>>> West Bengal, India
>>> M: (+91) 8436500886
>>>
>>>
>>>
>>
>> --
>> *Jayjeet Chakraborty*
>> B.Tech in Computer Sc. and Engineering
>> National Institute Of Technology, Durgapur
>> West Bengal, India
>> M: (+91) 8436500886
>>
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

-- 
*Jayjeet Chakraborty*
B.Tech in Computer Sc. and Engineering
National Institute Of Technology, Durgapur
West Bengal, India
M: (+91) 8436500886

Reply via email to