Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++ executable. LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench
On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <[email protected]> wrote: > @Jayeet, > > I ran your example in my desktop, and I don't see any timing issues there. > I used conda to install pyarrow==6.0.0 > I used the following command > g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include > -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench > > And I had to del the objects in the python file, because it was getting > killed due to OOM. > ``` > ... > for i in range(10): > s = time.time() > dataset_ = ds.dataset("/home/niranda/flight_dataset", > format="parquet") > table = dataset_.to_table(use_threads=False) > e = time.time() > print(e - s) > > del table > del dataset_ > gc.collect() > ``` > > For me c++ takes around ~21s and python ~22s which is expected. > > > On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty < > [email protected]> wrote: > >> Hi Sasha, >> >> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I >> tried it again (when compiling with PyArrow SO files) and unfortunately, it >> didn't improve the results. >> >> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky < >> [email protected]> wrote: >> >>> Hi Jayjeet, >>> I noticed that you're not compiling dataset_bench with optimizations >>> enabled. I'm not sure how much it will help, but it may be worth adding >>> `-O2` to your g++ invocation. >>> >>> Sasha Krassovsky >>> >>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < >>> [email protected]> wrote: >>> >>>> Hi Niranda, David, >>>> >>>> I ran my benchmarks again with the PyArrow .SO libraries which should >>>> be optimized. PyArrow version was 6.0.1 installed from pip. Here are my new >>>> results [1]. Numbers didn't quite seem to improve. You can check my build >>>> config in the Makefile [2]. I created a README [3] to make it easy for you >>>> to reproduce on your end. Thanks. >>>> >>>> [1] >>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized >>>> [2] >>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile >>>> [3] >>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md >>>> >>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera < >>>> [email protected]> wrote: >>>> >>>>> Hi Jayeet, >>>>> >>>>> Could you try building your cpp project against the arrow.so in >>>>> pyarrow installation? It should be in the lib directory in your python >>>>> environment. >>>>> >>>>> Best >>>>> >>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks for your reply, David. >>>>>> >>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>>>>> 2) The dataset was deployed using this [1] script. >>>>>> 3) For C++, Arrow was built from source in release mode. You can see >>>>>> the CMake config here [2]. >>>>>> >>>>>> I think I need to test once with Arrow C++ installed from packages >>>>>> instead of me building it. That might be the issue. >>>>>> >>>>>> [1] >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>>>>> [2] >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>>>>> >>>>>> Best, >>>>>> Jayjeet >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote: >>>>>> >>>>>>> Hi Jayjeet, >>>>>>> >>>>>>> That's odd since the Python API is just wrapping the C++ API, so >>>>>>> they should be identical if everything is configured the same. (So is >>>>>>> the >>>>>>> Java API, incidentally.) That's effectively what the SO question is >>>>>>> saying. >>>>>>> >>>>>>> What versions of PyArrow and Arrow are you using? Just to check the >>>>>>> obvious things, was Arrow compiled with optimizations? And if we want to >>>>>>> replicate this, is it possible to get the dataset? >>>>>>> >>>>>>> -David >>>>>>> >>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>>>>> >>>>>>> Hi Arrow community, >>>>>>> >>>>>>> I was working on a class project for benchmarking Apache Arrow >>>>>>> Dataset API in different programming languages. I found out that for >>>>>>> some >>>>>>> reason the C++ API example is slower than the Python API example. I ran >>>>>>> my >>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I >>>>>>> tried >>>>>>> my best to cross verify if all the parameters are similar in the Python >>>>>>> and >>>>>>> C++ examples. It would be great to know if someone had similar >>>>>>> observations >>>>>>> in the past and if the reason for this is known. I would really like to >>>>>>> know more about this phenomenon. You can find the code and the results >>>>>>> here >>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the >>>>>>> exact >>>>>>> reason. Thanks a lot for your help. >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>>>>> >>>>>>> [2] >>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>>>>> >>>>>>> Best Regards, >>>>>>> *Jayjeet Chakraborty* >>>>>>> Ph.D. Student >>>>>>> Department of Computer Science and Engineering >>>>>>> University of California, Santa Cruz >>>>>>> >>>>>>> -- >>>>>>> *Jayjeet Chakraborty* >>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>> National Institute Of Technology, Durgapur >>>>>>> West Bengal, India >>>>>>> M: (+91) 8436500886 >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> *Jayjeet Chakraborty* >>>>>> B.Tech in Computer Sc. and Engineering >>>>>> National Institute Of Technology, Durgapur >>>>>> West Bengal, India >>>>>> M: (+91) 8436500886 >>>>>> >>>>> >>>>> >>>>> -- >>>>> Niranda Perera >>>>> https://niranda.dev/ >>>>> @n1r44 <https://twitter.com/N1R44> >>>>> >>>>> >>>> >>>> -- >>>> *Jayjeet Chakraborty* >>>> B.Tech in Computer Sc. and Engineering >>>> National Institute Of Technology, Durgapur >>>> West Bengal, India >>>> M: (+91) 8436500886 >>>> >>> >> >> -- >> *Jayjeet Chakraborty* >> CS PhD student >> UC Santa Cruz >> California, USA >> >> > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>
