Just for documentation purposes: I created FLINK-24147 [1] to cover this
issue.

[1] https://issues.apache.org/jira/browse/FLINK-24147

On Thu, Aug 26, 2021 at 6:14 PM Matthias Pohl <matth...@ververica.com>
wrote:

> I see - I should have checked my mailbox before answering. I received the
> email and was able to login.
>
> On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl <matth...@ververica.com>
> wrote:
>
>> The link doesn't work, i.e. I'm redirected to a login page. It would be
>> also good to include the Flink logs and make them accessible for everyone.
>> This way others could share their perspective as well...
>>
>> On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] <
>> siddharth.x.s...@gs.com> wrote:
>>
>>> Hi Matthias,
>>>
>>>
>>>
>>> Thank you for responding and taking time to look at the issue.
>>>
>>>
>>>
>>> Uploaded the yarn lags here:
>>> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/
>>> and have also requested read permissions for you. Please let us know if
>>> you’re not able to see the files.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Matthias Pohl <matth...@ververica.com>
>>> *Sent:* Thursday, August 26, 2021 9:47 AM
>>> *To:* Shah, Siddharth [Engineering] <siddharth.x.s...@ny.email.gs.com>
>>> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
>>> andreas.ha...@ny.email.gs.com>
>>> *Subject:* Re: hdfs lease issues on flink retry
>>>
>>>
>>>
>>> Hi Siddharth,
>>>
>>> thanks for reaching out to the community. This might be a bug. Could you
>>> share your Flink and YARN logs? This way we could get a better
>>> understanding of what's going on.
>>>
>>>
>>>
>>> Best,
>>> Matthias
>>>
>>>
>>>
>>> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
>>> siddharth.x.s...@gs.com> wrote:
>>>
>>> Hi  Team,
>>>
>>>
>>>
>>> We are seeing transient failures in the jobs mostly requiring higher
>>> resources and using flink RestartStrategies
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=>
>>> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
>>> flink retry happens. The job originally fails for the first try with 
>>> PartitionNotFoundException
>>> or NoResourceAvailableException., but on retry it seems form the yarn logs
>>> is that the lease for the temp sink directory is not yet released by the
>>> node from previous try.
>>>
>>>
>>>
>>> Initial Failure log message:
>>>
>>>
>>>
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>> Could not allocate enough slots to run the job. Please make sure that the
>>> cluster has enough resources.
>>>
>>>         at
>>> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>>
>>>         at
>>> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>>
>>>         at
>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>>
>>>
>>>
>>>
>>>
>>> Retry failure log message:
>>>
>>>
>>>
>>> Caused by: 
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException):
>>>  
>>> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet
>>>  for client 10.51.63.226 already exists
>>>
>>>         at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815)
>>>
>>>         at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702)
>>>
>>>         at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586)
>>>
>>>         at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736)
>>>
>>>         at 
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
>>>
>>>         at 
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>
>>>         at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I could verify that it’s the same nodes from previous try owning the
>>> lease, and checked for multiple jobs by matching IP addresses. Ideally, we
>>> want an internal retry to happen since there will be thousands of jobs
>>> running at a time and hard to manually retry them.
>>>
>>>
>>>
>>> This is our current restart config:
>>>
>>> executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3,
>>> Time.*of*(10, TimeUnit.*SECONDS*)));
>>>
>>>
>>>
>>> Is it possible to resolve leases before a retry? Or is it possible to
>>> have different sink directories (increment attempt id somewhere) for every
>>> retry, that way we have no lease issues? Or do you have any other
>>> suggestion on resolving this?
>>>
>>>
>>>
>>>
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Siddharth
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>>
>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices
>>>
>>>
>>> ------------------------------
>>>
>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices
>>>
>>
>>
>
> --
>
> Matthias Pohl | Engineer
>
> Follow us @VervericaData Ververica <https://www.ververica.com/>
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>

Reply via email to