[jira] [Comment Edited] (FLINK-29617) Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster

luoyuxia (Jira) Thu, 13 Oct 2022 02:07:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-29617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616855#comment-17616855
 ]


luoyuxia edited comment on FLINK-29617 at 10/13/22 9:06 AM:
------------------------------------------------------------

[~dangshazi] Thanks for raising it and detail explanation. I'll be much 
appreciated that you can take the ticket.  If you don't have time, maybe I can 
help take it.

I'm fine with these two suggestions. But prefer suggestion 2 since suggestion 1 
will bring new option which user may hardly know it.

I have one question, have you ever tried with these suggestions? If so, what's 
the improvement of these two suggestions?

Btw, the images uploaded is . Could you please upload them again?


was (Author: luoyuxia):
[~dangshazi] Thanks for raising it and detail explanation. I'll be much 
appreciated that you can take the ticket. 

I'm fine with these two suggestions. But prefer suggestion 2 since suggestion 1 
will bring new option which user may hardly know it.

I have one question, have you ever tried with these suggestions? If so, what's 
the improvement of these two suggestions?

Btw, the images uploaded is . Could you please upload them again?

> Cost too much time to start SourceCoordinator of hdfsFileSource when start 
> JobMaster
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-29617
>                 URL: https://issues.apache.org/jira/browse/FLINK-29617
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, Runtime / Coordination
>    Affects Versions: 1.15.2
>            Reporter: LI Mingkun
>            Priority: Major
>              Labels: coordination, file-system
>
> h1. Scenario:
> Our user use flink batch to compact small files in one day. Flink version : 
> 1.15
> He split pipeline into 24 for each hour. So there are 24 source
>  
> I find it  costs too much time to start SourceCoordinator of hdfsFileSource 
> when start JobMaster
>  
>  as follow:
>  
> !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.1&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9SVAoAslMUGQdVQJ_ccmEf4LxhaONYKJvS_V8nvijvT3JXw_VlyRBAEE9EQhTtWdYPa4TLCO5rxjXGrTDK2_PGHX4RZDPTQTJ0LwKXAUr4BYlMhYZsjcrY9eo&disp=emb&realattid=ii_l95bh7qy0|width=542,height=260!
>  
> h1. Root Cause:
> I got the root cause after check: 
>  # AbstractFileSource will enumerateSplits when createEnumerator
>  # NotSplittingRecursiveEnumerator need to get fileblockLocation of every 
> fileblock which is a heavy IO operation
> !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.3&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ8AoT071eCNMb_q3uJtcbrUmZnYbg3ucnDelMlRRPn7WLlXOBGj650srQk9vhqKyJEANvpOWoxHuH6jNHt7g6go8JkeRUZKc81yqT0yzzz7tbBciTe-YnRVQ7w&disp=emb&realattid=ii_l95bp1832|width=542,height=456!
>  
> !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.2&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9phsX1nauTsx3xWje_YJM4uUaOLXKHcXKsm7WJquPQQGC7bQTni3OhQB5HtGYVOvrD-3Kbp9LURfUj6OiIUgsZU1AImSL0vj27cnDcf7HpVpLpaqdADtpoABU&disp=emb&realattid=ii_l95bjh1g1|width=526,height=542!
>  
> h1. Suggestion
>  # FileSource add option to disable location fetcher
>  # Move location fetcher into IOExecutor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-29617) Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster

Reply via email to