The link doesn't work, i.e. I'm redirected to a login page. It would be
also good to include the Flink logs and make them accessible for everyone.
This way others could share their perspective as well...

On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] <
siddharth.x.s...@gs.com> wrote:

> Hi Matthias,
>
>
>
> Thank you for responding and taking time to look at the issue.
>
>
>
> Uploaded the yarn lags here:
> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/
> and have also requested read permissions for you. Please let us know if
> you’re not able to see the files.
>
>
>
>
>
> *From:* Matthias Pohl <matth...@ververica.com>
> *Sent:* Thursday, August 26, 2021 9:47 AM
> *To:* Shah, Siddharth [Engineering] <siddharth.x.s...@ny.email.gs.com>
> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
> andreas.ha...@ny.email.gs.com>
> *Subject:* Re: hdfs lease issues on flink retry
>
>
>
> Hi Siddharth,
>
> thanks for reaching out to the community. This might be a bug. Could you
> share your Flink and YARN logs? This way we could get a better
> understanding of what's going on.
>
>
>
> Best,
> Matthias
>
>
>
> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
> siddharth.x.s...@gs.com> wrote:
>
> Hi  Team,
>
>
>
> We are seeing transient failures in the jobs mostly requiring higher
> resources and using flink RestartStrategies
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=>
> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
> flink retry happens. The job originally fails for the first try with 
> PartitionNotFoundException
> or NoResourceAvailableException., but on retry it seems form the yarn logs
> is that the lease for the temp sink directory is not yet released by the
> node from previous try.
>
>
>
> Initial Failure log message:
>
>
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate enough slots to run the job. Please make sure that the
> cluster has enough resources.
>
>         at
> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>
>         at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
>         at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
>         at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
>         at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>
>         at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
>         at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
>         at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
>
>
>
>
> Retry failure log message:
>
>
>
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException):
>  
> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet
>  for client 10.51.63.226 already exists
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815)
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702)
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586)
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736)
>
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
>
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>
>
>
>
>
>
>
> I could verify that it’s the same nodes from previous try owning the
> lease, and checked for multiple jobs by matching IP addresses. Ideally, we
> want an internal retry to happen since there will be thousands of jobs
> running at a time and hard to manually retry them.
>
>
>
> This is our current restart config:
>
> executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3,
> Time.*of*(10, TimeUnit.*SECONDS*)));
>
>
>
> Is it possible to resolve leases before a retry? Or is it possible to have
> different sink directories (increment attempt id somewhere) for every
> retry, that way we have no lease issues? Or do you have any other
> suggestion on resolving this?
>
>
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=>
>
>
>
>
>
> Thanks,
>
> Siddharth
>
>
>
>
> ------------------------------
>
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>

Reply via email to