Hi, I would also suggest to just attach a code profiler to the process during those 2 hours and gather some results. It might answer some questions what is taking so long time.
Piotrek > On 30 Oct 2019, at 15:11, Chris Miller <chris...@gmail.com> wrote: > > I haven't run any benchmarks with Flink or even used it enough to directly > help with your question, however I suspect that the following article might > be relevant: > > http://dsrg.pdos.csail.mit.edu/2016/06/26/scalability-cost/ > <http://dsrg.pdos.csail.mit.edu/2016/06/26/scalability-cost/> > > Given the computation you're performing is trivial, it's possible that the > additional overhead of serialisation, interprocess communication, state > management etc that distributed systems like Flink require are dominating the > runtime here. 2 hours (or even 25 minutes) still seems too long to me > however, so hopefully it really is just a configuration issue of some sort. > Either way, if you do figure this out or anyone with good knowledge of the > article above in relation to Flink is able to give their thoughts, I'd be > very interested in hearing more. > > Regards, > Chris > > > ------ Original Message ------ > From: "Habib Mostafaei" <ha...@inet.tu-berlin.de > <mailto:ha...@inet.tu-berlin.de>> > To: "Zhenghua Gao" <doc...@gmail.com <mailto:doc...@gmail.com>> > Cc: "user" <user@flink.apache.org <mailto:user@flink.apache.org>>; "Georgios > Smaragdakis" <georg...@inet.tu-berlin.de > <mailto:georg...@inet.tu-berlin.de>>; "Niklas Semmler" > <nik...@inet.tu-berlin.de <mailto:nik...@inet.tu-berlin.de>> > Sent: 30/10/2019 12:25:28 > Subject: Re: low performance in running queries > >> Thanks Gao for the reply. I used the parallelism parameter with different >> values like 6 and 8 but still the execution time is not comparable with a >> single threaded python script. What would be the reasonable value for the >> parallelism? >> >> Best, >> >> Habib >> >> On 10/30/2019 1:17 PM, Zhenghua Gao wrote: >>> The reason might be the parallelism of your task is only 1, that's too low. >>> See [1] to specify proper parallelism for your job, and the execution time >>> should be reduced significantly. >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html >>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html> >>> >>> Best Regards, >>> Zhenghua Gao >>> >>> >>> On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei <ha...@inet.tu-berlin.de >>> <mailto:ha...@inet.tu-berlin.de>> wrote: >>> Hi all, >>> >>> I am running Flink on a standalone cluster and getting very long >>> execution time for the streaming queries like WordCount for a fixed text >>> file. My VM runs on a Debian 10 with 16 cpu cores and 32GB of RAM. I >>> have a text file with size of 2GB. When I run the Flink on a standalone >>> cluster, i.e., one JobManager and one taskManager with 25GB of heapsize, >>> it took around two hours to finish counting this file while a simple >>> python script can do it in around 7 minutes. Just wondering what is >>> wrong with my setup. I ran the experiments on a cluster with six >>> taskManagers, but I still get very long execution time like 25 minutes >>> or so. I tried to increase the JVM heap size to have lower execution >>> time but it did not help. I attached the log file and the Flink >>> configuration file to this email. >>> >>> Best, >>> >>> Habib >>> >> -- >> Habib Mostafaei, Ph.D. >> Postdoctoral researcher >> TU Berlin, >> FG INET, MAR 4.003 >> Marchstraße 23, 10587 Berlin