Re: low performance in running queries

Zhenghua Gao Thu, 31 Oct 2019 22:04:16 -0700

2019-10-30 15:59:52,122 INFO
org.apache.flink.runtime.taskmanager.Task                     - Split
Reader: Custom File Source -> Flat Map (1/1)
(6a17c410c3e36f524bb774d2dffed4a4) switched from DEPLOYING to RUNNING.


2019-10-30 17:45:10,943 INFO
org.apache.flink.runtime.taskmanager.Task                     - Split
Reader: Custom File Source -> Flat Map (1/1)
(6a17c410c3e36f524bb774d2dffed4a4) switched from RUNNING to FINISHED.


It's surprise that the source task uses 95 mins to read a 2G file.

Could you give me your code snippets and some sample lines of the 2G file?

I will try to reproduce your scenario and dig the root causes.


*Best Regards,*
*Zhenghua Gao*


On Thu, Oct 31, 2019 at 9:05 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
wrote:

> I enclosed all logs from the run and for this run I used parallelism one.
> However, for other runs I checked and found that all parallel workers were
> working properly. Is there a simple way to get profiling information in
> Flink?
>
> Best,
>
> Habib
> On 10/31/2019 2:54 AM, Zhenghua Gao wrote:
>
> I think more runtime information would help figure out where the problem
>  is.
> 1) how many parallelisms actually working
> 2) the metrics for each operator
> 3) the jvm profiling information, etc
>
> *Best Regards,*
> *Zhenghua Gao*
>
>
> On Wed, Oct 30, 2019 at 8:25 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
> wrote:
>
>> Thanks Gao for the reply. I used the parallelism parameter with different
>> values like 6 and 8 but still the execution time is not comparable with a
>> single threaded python script. What would be the reasonable value for the
>> parallelism?
>>
>> Best,
>>
>> Habib
>> On 10/30/2019 1:17 PM, Zhenghua Gao wrote:
>>
>> The reason might be the parallelism of your task is only 1, that's too
>> low.
>> See [1] to specify proper parallelism  for your job, and the execution
>> time should be reduced significantly.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html
>>
>> *Best Regards,*
>> *Zhenghua Gao*
>>
>>
>> On Tue, Oct 29, 2019 at 9:27 PM Habib Mostafaei <ha...@inet.tu-berlin.de>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am running Flink on a standalone cluster and getting very long
>>> execution time for the streaming queries like WordCount for a fixed text
>>> file. My VM runs on a Debian 10 with 16 cpu cores and 32GB of RAM. I
>>> have a text file with size of 2GB. When I run the Flink on a standalone
>>> cluster, i.e., one JobManager and one taskManager with 25GB of heapsize,
>>> it took around two hours to finish counting this file while a simple
>>> python script can do it in around 7 minutes. Just wondering what is
>>> wrong with my setup. I ran the experiments on a cluster with six
>>> taskManagers, but I still get very long execution time like 25 minutes
>>> or so. I tried to increase the JVM heap size to have lower execution
>>> time but it did not help. I attached the log file and the Flink
>>> configuration file to this email.
>>>
>>> Best,
>>>
>>> Habib
>>>
>>>
>

Re: low performance in running queries

Reply via email to