[ 
https://issues.apache.org/jira/browse/FLINK-9597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

swy updated FLINK-9597:
-----------------------
    Attachment: sample.png

> Flink fail to scale!
> --------------------
>
>                 Key: FLINK-9597
>                 URL: https://issues.apache.org/jira/browse/FLINK-9597
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.5.0
>            Reporter: swy
>            Priority: Major
>         Attachments: JM.png, TM.png, flink_app_parser_git.zip, sample.png, 
> scaleNotWork.png
>
>
> Hi, we found that our Flink application with simple logic, which using 
> process function is not scale-able when scale from 8 parallelism onward even 
> though with sufficient resources. Below it the result which is capped at 
> ~250k TPS. No matter how we tune the parallelism of the operators it just not 
> scale, same to increase source parallelism.
> Please refer to "scaleNotWork.png",
> 1. fixed source parallelism 4, other operators parallelism 8
> 2. fixed source parallelism 4, other operators parallelism 16
> 3. fixed source parallelism 4, other operators parallelism 32
> 4. fixed source parallelism 6, other operators parallelism 8
> 5. fixed source parallelism 6, other operators parallelism 16
> 6. fixed source parallelism 6, other operators parallelism 32
> 7. fixed source parallelism 6, other operators parallelism 64 performance 
> worse than parallelism 32.
> Sample source code attached(flink_app_parser_git.zip). It is a simple 
> program, parsing json record into object, and pass it to a empty logic 
> Flink's process function. Rocksdb is in used, and the source is generated by 
> the program itself. This could be reproduce easily. 
> We choose Flink because of it scalability, but this is not the case now, 
> appreciated if anyone could help as this is impacting our projects! thank you.
> To run the program, sample parameters,
> "aggrinterval=6000000 loop=7500000 statsd=1 psrc=4 pJ2R=32 pAggr=72 
> URL=do36.comptel.com:8127"
> * aggrinterval: time in ms for timer to trigger
> * loop: how many row of data to feed
> * statsd: to send result to statsd
> * psrc: source parallelism
> * pJ2R: parallelism of map operator(JsonRecTranslator)
> * pAggr: parallelism of process+timer operator(AggregationDuration) 
> We are running in VMWare, 5 Task Managers and each has 32 slots.
> lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                32
> On-line CPU(s) list:   0-31
> Thread(s) per core:    1
> Core(s) per socket:    1
> Socket(s):             32
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 63
> Model name:            Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
> Stepping:              2
> CPU MHz:               2593.993
> BogoMIPS:              5187.98
> Hypervisor vendor:     VMware
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              20480K
> NUMA node0 CPU(s):     0-31
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp 
> lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc 
> aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt aes xsave avx f16c rdrand hypervisor lahf_lm epb fsgsbase smep dtherm 
> ida arat pln pts
>               total        used        free      shared  buff/cache   
> available
> Mem:             98          24          72           0           1          
> 72
> Swap:             3           0           3
> Please refer TM.png and JM.png for further details.
> The test without any checkpoint enable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to