The InetAddress caches the result of getCanonicalHostName(), so it is not a problem to call it twice.

On 10/15/2020 1:57 PM, Till Rohrmann wrote:
Hi Weike,

thanks for getting back to us with your findings. Looking at the `TaskManagerLocation`, we are actually calling `InetAddress.getCanonicalHostName` twice for every creation of a `TaskManagerLocation` instance. This does not look right.

I think it should be fine to make the look up configurable. Moreover, one could think about only doing a lazy look up if the canonical hostname is really needed (as far as I can see it is only really needed input split assignments and for the LocationPreferenceSlotSelectionStrategy to calculate how many TMs run on the same machine).

Do you want to fix this issue?

Cheers,
Till

On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyled...@connect.hku.hk <mailto:kyled...@connect.hku.hk>> wrote:

    Hi Till and community,

    By the way, initially I resolved the IPs several times but results
    returned rather quickly (less than 1ms, possibly due to DNS cache
    on the server), so I thought it might not be the DNS issue.

    However, after debugging and logging, it is found that the lookup
    time exhibited high variance, i. e. normally it completes fast but
    occasionally some slow results would block the thread. So an
    unstable DNS server might have a great impact on the performance
    of Flink job startup.

    Best,
    Weike

    On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike
    <kyled...@connect.hku.hk <mailto:kyled...@connect.hku.hk>> wrote:

        Hi Till and community,

        Increasing `kubernetes.jobmanager.cpu` in the configuration
        makes this issue alleviated but not disappeared.

        After adding DEBUG logs to the internals of /flink-runtime/,
        we have found the culprit is

        inetAddress.getCanonicalHostName()

        in
        /org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName/
        and
        
/org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName/,
        which could take ~ 6 seconds to complete, thus Akka
        dispatcher(s) are severely blocked by that.

        By commenting out the two methods, this issue seems to be
        solved immediately, so I wonder if Flink could provide a
        configuration parameter to turn off the DNS reverse lookup
        process, as it seems that Flink jobs could run happily without it.

        Sincerely,
        Weike


        On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann
        <trohrm...@apache.org <mailto:trohrm...@apache.org>> wrote:

            Hi Weike,

            could you try setting kubernetes.jobmanager.cpu: 4 in your
            flink-conf.yaml? I fear that a single CPU is too low for
            the JobManager component.

            Cheers,
            Till

            On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann
            <trohrm...@apache.org <mailto:trohrm...@apache.org>> wrote:

                Hi Weike,

                thanks for posting the logs. I will take a look at
                them. My suspicion would be that there is some
                operation blocking the JobMaster's main thread which
                causes the registrations from the TMs to time out.
                Maybe the logs allow me to validate/falsify this
                suspicion.

                Cheers,
                Till

                On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike
                <kyled...@connect.hku.hk
                <mailto:kyled...@connect.hku.hk>> wrote:

                    Hi community,

                    I have uploaded the log files of JobManager and
                    TaskManager-1-1 (one of the 50 TaskManagers) with
                    DEBUG log level and default Flink configuration,
                    and it clearly shows that TaskManager failed to
                    register with JobManager after 10 attempts.

                    Here is the link:

                    JobManager:
                    
https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce

                    TaskManager-1-1:
                    
https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe


                    Thanks : )

                    Best regards,
                    Weike


                    On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike
                    <kyled...@connect.hku.hk
                    <mailto:kyled...@connect.hku.hk>> wrote:

                        Hi community,

                        Recently we have noticed a strange behavior
                        for Flink jobs on Kubernetes per-job mode:
                        when the parallelism increases, the time it
                        takes for the TaskManagers to register with
                        *JobManager *becomes abnormally long (for a
                        task with parallelism of 50, it could take 60
                        ~ 120 seconds or even longer for the
                        registration attempt), and usually more than
                        10 attempts are needed to finish this
                        registration.

                        Because of this, we could not submit a job
                        requiring more than 20 slots with the default
                        configuration, as the TaskManager would say:

                            Registration at JobManager
                            
(akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
                            attempt 9 timed out after 25600 ms

                            Free slot with allocation id
                            60d5277e138a94fb73fc6691557001e0 because:
                            The slot 60d5277e138a94fb73fc6691557001e0
has timed out.
                            Free slot TaskSlot(index:0,
                            state:ALLOCATED, resource profile:
                            ResourceProfile{cpuCores=1.0000000000000000,
                            taskHeapMemory=1.425gb (1530082070 bytes),
                            taskOffHeapMemory=0 bytes,
                            managedMemory=1.340gb (1438814063 bytes),
                            networkMemory=343.040mb (359703515
                            bytes)}, allocationId:
                            60d5277e138a94fb73fc6691557001e0, jobId:
                            493cd86e389ccc8f2887e1222903b5ce).
                            java.lang.Exception: The slot
                            60d5277e138a94fb73fc6691557001e0 has timed
out.

                        In order to cope with this issue, we have to
                        change the below configuration parameters:


                            # Prevent "Could not allocate the required
                            slot within slot request timeout. Please
                            make sure that the cluster has enough
                            resources. Stopping the JobMaster for job"
slot.request.timeout: 500000
                            # Increase max timeout in a single attempt
                            cluster.registration.max-timeout: 300000
                            # Prevent "free slot (TaskSlot)"
                            akka.ask.timeout: 10 min
                            # Prevent "Heartbeat of TaskManager timed
                            out."
                            heartbeat.timeout: 500000


                        However, we acknowledge that this is only a
                        temporary dirty fix, which is not what we
                        want. It could be seen that during TaskManager
                        registration to JobManager, lots of warning
                        messages come out in logs:

                            No hostname could be resolved for the IP
                            address 9.166.0.118, using IP address as
                            host name. Local input split assignment
                            (such as for HDFS files) may be impacted.


                        Initially we thought this was probably the
                        cause (reverse lookup of DNS might take up a
                        long time), however we later found that the
                        reverse lookup only took less than 1ms, so
                        maybe not because of this.

                        Also, we have checked the GC log of both
                        TaskManagers and JobManager, and they seem to
                        be perfectly normal, without any signs of
                        pauses. And the heartbeats are processed as
                        normal according to the logs.

                        Moreover, TaskManagers register quickly with
                        ResourceManager, but then extra slow with
                        TaskManager, so this is not because of a slow
                        network connection.

                        Here we wonder what could be the cause for the
                        slow registration between JobManager and
                        TaskManager(s)? No other warning or error
                        messages in the log (DEBUG level) other than
                        the "No hostname could be resolved" messages,
                        which is quite weird.

                        Thanks for the reading, and hope to get some
                        insights into this issues : )

                        Sincerely,
                        Weike


Reply via email to