Ticket FLINK-17714 is created to track this requirement. Thanks, Zhu Zhu
Till Rohrmann <[email protected]> 于2020年5月13日周三 下午8:30写道: > Yes, you are right Zhu Zhu. Extending > the RestartBackoffTimeStrategyFactoryLoader to also load custom > RestartBackoffTimeStrategies sound like a good improvement for the future. > > @Ken Krugler <[email protected]>, the old RestartStrategy > interface did not provide the cause of the failure, unfortunately. > > Cheers, > Till > > On Wed, May 13, 2020 at 7:55 AM Zhu Zhu <[email protected]> wrote: > >> Hi Ken, >> >> Custom restart-strategy was an experimental feature and was deprecated >> since 1.10. [1] >> That's why you cannot find any documentation for it. >> >> The old RestartStrategy was deprecated and replaced by >> RestartBackoffTimeStrategy since 1.10 >> (unless you are using the legacy scheduler which was also deprecated). >> The new restart strategy, RestartBackoffTimeStrategy, will be able to >> know the exact failure cause. >> However, the new restart strategy does not support customization at the >> moment. >> Your requirement sounds reasonable to me and I think custom (new) restart >> strategy can be something to support later. >> >> @Till Rohrmann <[email protected]> @Gary Yao <[email protected]> what >> do you think? >> >> [1] >> https://lists.apache.org/thread.html/6ed95eb6a91168dba09901e158bc1b6f4b08f1e176db4641f79de765%40%3Cdev.flink.apache.org%3E >> >> Thanks, >> Zhu Zhu >> >> Ken Krugler <[email protected]> 于2020年5月13日周三 上午7:34写道: >> >>> Hi Til, >>> >>> Sorry, missed the key question…in the RestartStrategy.restart() method, >>> I don’t see any good way to get at the underlying exception. >>> >>> I can cast the RestartCallback to an ExecutionGraphRestartCallback, but >>> I still need access to the private execGraph to be able to get at the >>> failure info. Is there some other way in the restart handler to get at this? >>> >>> And yes, I meant to note you’d mentioned the required static method in >>> your email, I was asking about documentation for it. >>> >>> Thanks, >>> >>> — Ken >>> >>> =============================================================== >>> Sorry to resurface an ancient question, but is there a working example >>> anywhere of setting a custom restart strategy? >>> >>> Asking because I’ve been wandering through the Flink 1.9 code base for a >>> while, and the restart strategy implementation is…pretty tangled. >>> >>> From what I’ve been able to figure out, you have to provide a factory >>> class, something like this: >>> >>> Configuration config = new Configuration(); >>> config.setString(ConfigConstants.RESTART_STRATEGY, >>> MyRestartStrategyFactory.class.getCanonicalName()); >>> StreamExecutionEnvironment env = >>> StreamExecutionEnvironment.createLocalEnvironment(4, config); >>> >>> That factory class should extend RestartStrategyFactory, but it also >>> needs to implement a static method that looks like: >>> >>> public static MyRestartStrategyFactory >>> createFactory(Configuration config) { >>> return new MyRestartStrategyFactory(); >>> } >>> >>> I wasn’t able to find any documentation that mentioned this particular >>> method being a requirement. >>> >>> And also the documentation at >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#fault-tolerance >>> doesn’t >>> mention you can set a custom class name for the restart-strategy. >>> >>> Thanks, >>> >>> — Ken >>> >>> >>> On Nov 22, 2018, at 8:18 AM, Till Rohrmann <[email protected]> wrote: >>> >>> Hi Kasif, >>> >>> I think in this situation it is best if you defined your own custom >>> RestartStrategy by specifying a class which has a `RestartStrategyFactory >>> createFactory(Configuration configuration)` method as `restart-strategy: >>> MyRestartStrategyFactoryFactory` in `flink-conf.yaml`. >>> >>> Cheers, >>> Till >>> >>> On Thu, Nov 22, 2018 at 7:18 AM Ali, Kasif <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> >>>> >>>> Looking at existing restart strategies they are kind of generic. We >>>> have a requirement to restart the job only in case of specific >>>> exception/issues. >>>> >>>> What would be the best way to have a re start strategy which is based >>>> on few rules like looking at particular type of exception or some extra >>>> condition checks which are application specific.? >>>> >>>> >>>> >>>> Just a background on one specific issue which invoked this requirement >>>> is slots not getting released when the job finishes. In our applications, >>>> we keep track of jobs submitted with the amount of parallelism allotted to >>>> it. Once the job finishes we assume that the slots are free and try to >>>> submit next set of jobs which at times fail with error “not enough slots >>>> available”. >>>> >>>> >>>> >>>> So we think a job re start can solve this issue but we only want to re >>>> start only if this particular situation is encountered. >>>> >>>> >>>> >>>> Please let us know If there are better ways to solve this problem other >>>> than re start strategy. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Kasif >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Your Personal Data: We may collect and process information about you >>>> that may be subject to data protection laws. For more information about how >>>> we use and disclose your personal data, how we protect your information, >>>> our legal basis to use your information, your rights and who you can >>>> contact, please refer to: www.gs.com/privacy-notices >>>> >>> >>> -------------------------- >>> Ken Krugler >>> http://www.scaleunlimited.com >>> custom big data solutions & training >>> Hadoop, Cascading, Cassandra & Solr >>> >>>
