@Jayeet,
I ran your example in my desktop, and I don't see any timing issues there.
I used conda to install pyarrow==6.0.0
I used the following command
g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include
-L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench
And I had to del the objects in the python file, because it was getting
killed due to OOM.
```
...
for i in range(10):
s = time.time()
dataset_ = ds.dataset("/home/niranda/flight_dataset",
format="parquet")
table = dataset_.to_table(use_threads=False)
e = time.time()
print(e - s)
del table
del dataset_
gc.collect()
```
For me c++ takes around ~21s and python ~22s which is expected.
On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty <
[email protected]> wrote:
> Hi Sasha,
>
> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I tried
> it again (when compiling with PyArrow SO files) and unfortunately, it
> didn't improve the results.
>
> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <
> [email protected]> wrote:
>
>> Hi Jayjeet,
>> I noticed that you're not compiling dataset_bench with optimizations
>> enabled. I'm not sure how much it will help, but it may be worth adding
>> `-O2` to your g++ invocation.
>>
>> Sasha Krassovsky
>>
>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <
>> [email protected]> wrote:
>>
>>> Hi Niranda, David,
>>>
>>> I ran my benchmarks again with the PyArrow .SO libraries which should be
>>> optimized. PyArrow version was 6.0.1 installed from pip. Here are my new
>>> results [1]. Numbers didn't quite seem to improve. You can check my build
>>> config in the Makefile [2]. I created a README [3] to make it easy for you
>>> to reproduce on your end. Thanks.
>>>
>>> [1]
>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
>>> [2]
>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
>>> [3]
>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
>>>
>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <[email protected]>
>>> wrote:
>>>
>>>> Hi Jayeet,
>>>>
>>>> Could you try building your cpp project against the arrow.so in pyarrow
>>>> installation? It should be in the lib directory in your python environment.
>>>>
>>>> Best
>>>>
>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks for your reply, David.
>>>>>
>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>>>>> 2) The dataset was deployed using this [1] script.
>>>>> 3) For C++, Arrow was built from source in release mode. You can see
>>>>> the CMake config here [2].
>>>>>
>>>>> I think I need to test once with Arrow C++ installed from packages
>>>>> instead of me building it. That might be the issue.
>>>>>
>>>>> [1]
>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>>>>> [2]
>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>>>>
>>>>> Best,
>>>>> Jayjeet
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote:
>>>>>
>>>>>> Hi Jayjeet,
>>>>>>
>>>>>> That's odd since the Python API is just wrapping the C++ API, so they
>>>>>> should be identical if everything is configured the same. (So is the Java
>>>>>> API, incidentally.) That's effectively what the SO question is saying.
>>>>>>
>>>>>> What versions of PyArrow and Arrow are you using? Just to check the
>>>>>> obvious things, was Arrow compiled with optimizations? And if we want to
>>>>>> replicate this, is it possible to get the dataset?
>>>>>>
>>>>>> -David
>>>>>>
>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>>>>
>>>>>> Hi Arrow community,
>>>>>>
>>>>>> I was working on a class project for benchmarking Apache Arrow
>>>>>> Dataset API in different programming languages. I found out that for some
>>>>>> reason the C++ API example is slower than the Python API example. I ran
>>>>>> my
>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I
>>>>>> tried
>>>>>> my best to cross verify if all the parameters are similar in the Python
>>>>>> and
>>>>>> C++ examples. It would be great to know if someone had similar
>>>>>> observations
>>>>>> in the past and if the reason for this is known. I would really like to
>>>>>> know more about this phenomenon. You can find the code and the results
>>>>>> here
>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the exact
>>>>>> reason. Thanks a lot for your help.
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>>>>
>>>>>> [2]
>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>>>>
>>>>>> Best Regards,
>>>>>> *Jayjeet Chakraborty*
>>>>>> Ph.D. Student
>>>>>> Department of Computer Science and Engineering
>>>>>> University of California, Santa Cruz
>>>>>>
>>>>>> --
>>>>>> *Jayjeet Chakraborty*
>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>> National Institute Of Technology, Durgapur
>>>>>> West Bengal, India
>>>>>> M: (+91) 8436500886
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> *Jayjeet Chakraborty*
>>>>> B.Tech in Computer Sc. and Engineering
>>>>> National Institute Of Technology, Durgapur
>>>>> West Bengal, India
>>>>> M: (+91) 8436500886
>>>>>
>>>>
>>>>
>>>> --
>>>> Niranda Perera
>>>> https://niranda.dev/
>>>> @n1r44 <https://twitter.com/N1R44>
>>>>
>>>>
>>>
>>> --
>>> *Jayjeet Chakraborty*
>>> B.Tech in Computer Sc. and Engineering
>>> National Institute Of Technology, Durgapur
>>> West Bengal, India
>>> M: (+91) 8436500886
>>>
>>
>
> --
> *Jayjeet Chakraborty*
> CS PhD student
> UC Santa Cruz
> California, USA
>
>
--
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>