[jira] [Commented] (FLINK-10423) Forward RocksDB memory metrics to Flink metrics reporter

2018-09-25 Thread Monal Daxini (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627866#comment-16627866
 ] 

Monal Daxini commented on FLINK-10423:
--

It will be great to have this ported to 1.6 release as well. 

Spoke with [~srichter] about this offline.

> Forward RocksDB memory metrics to Flink metrics reporter 
> -
>
> Key: FLINK-10423
> URL: https://issues.apache.org/jira/browse/FLINK-10423
> Project: Flink
>  Issue Type: New Feature
>  Components: Metrics, State Backends, Checkpointing
>Reporter: Seth Wiesman
>Assignee: Seth Wiesman
>Priority: Major
>
> RocksDB contains a number of metrics at the column family level about current 
> memory usage, open memtables,  etc that would be useful to users wishing 
> greater insight what rocksdb is doing. This work is inspired heavily by the 
> comments on this rocksdb issue thread 
> (https://github.com/facebook/rocksdb/issues/3216#issuecomment-348779233)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10423) Forward RocksDB memory metrics to Flink metrics reporter

2018-09-25 Thread Monal Daxini (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627866#comment-16627866
 ] 

Monal Daxini edited comment on FLINK-10423 at 9/25/18 8:15 PM:
---

It will be great to have this ported to 1.6 release.

Spoke with [~srichter] about this offline.


was (Author: mdaxini):
It will be great to have this ported to 1.6 release as well. 

Spoke with [~srichter] about this offline.

> Forward RocksDB memory metrics to Flink metrics reporter 
> -
>
> Key: FLINK-10423
> URL: https://issues.apache.org/jira/browse/FLINK-10423
> Project: Flink
>  Issue Type: New Feature
>  Components: Metrics, State Backends, Checkpointing
>Reporter: Seth Wiesman
>Assignee: Seth Wiesman
>Priority: Major
>
> RocksDB contains a number of metrics at the column family level about current 
> memory usage, open memtables,  etc that would be useful to users wishing 
> greater insight what rocksdb is doing. This work is inspired heavily by the 
> comments on this rocksdb issue thread 
> (https://github.com/facebook/rocksdb/issues/3216#issuecomment-348779233)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9061) add entropy to s3 path for better scalability

2018-06-07 Thread Monal Daxini (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505447#comment-16505447
 ] 

Monal Daxini commented on FLINK-9061:
-

In addition to what [~stevenz3wu] and [~jgrier] suggest, it will be good to 
make the entropy generation pluggable. This way users can override the default 
entropy generation, if they need to. 
{code:java}
// code placeholder
state.backend.fs.checkpointdir.injectEntropy.strategy=com.foo.MyStaticEntropyGenerator
{code}

> add entropy to s3 path for better scalability
> -
>
> Key: FLINK-9061
> URL: https://issues.apache.org/jira/browse/FLINK-9061
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, State Backends, Checkpointing
>Affects Versions: 1.5.0, 1.4.2
>Reporter: Jamie Grier
>Assignee: Indrajit Roychoudhury
>Priority: Critical
>
> I think we need to modify the way we write checkpoints to S3 for high-scale 
> jobs (those with many total tasks).  The issue is that we are writing all the 
> checkpoint data under a common key prefix.  This is the worst case scenario 
> for S3 performance since the key is used as a partition key.
>  
> In the worst case checkpoints fail with a 500 status code coming back from S3 
> and an internal error type of TooBusyException.
>  
> One possible solution would be to add a hook in the Flink filesystem code 
> that allows me to "rewrite" paths.  For example say I have the checkpoint 
> directory set to:
>  
> s3://bucket/flink/checkpoints
>  
> I would hook that and rewrite that path to:
>  
> s3://bucket/[HASH]/flink/checkpoints, where HASH is the hash of the original 
> path
>  
> This would distribute the checkpoint write load around the S3 cluster evenly.
>  
> For reference: 
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>  
> Any other people hit this issue?  Any other ideas for solutions?  This is a 
> pretty serious problem for people trying to checkpoint to S3.
>  
> -Jamie
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8571) Provide an enhanced KeyedStream implementation to use ForwardPartitioner

2018-02-07 Thread Monal Daxini (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355877#comment-16355877
 ] 

Monal Daxini commented on FLINK-8571:
-

Thanks Stefan for the quick turn around time!

Can this patch be applied directly to 1.4, and would this be available in the 
next 1.4.1 release as well?

> Provide an enhanced KeyedStream implementation to use ForwardPartitioner
> 
>
> Key: FLINK-8571
> URL: https://issues.apache.org/jira/browse/FLINK-8571
> Project: Flink
>  Issue Type: Improvement
>Reporter: Nagarjun Guraja
>Assignee: Stefan Richter
>Priority: Major
>
> This enhancement would help in modeling problems with pre partitioned input 
> sources(for e.g. Kafka with Keyed topics). This would help in making the job 
> graph embarrassingly parallel while leveraging rocksdb state backend and also 
> the fine grained recovery semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8042) Retry individual failover-strategy for some time first before reverting to full job restart

2018-02-06 Thread Monal Daxini (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354124#comment-16354124
 ] 

Monal Daxini commented on FLINK-8042:
-

This bug is quite detrimental to workloads where there is large state and is 
massively parallel. This is potential blocker for one of our use cases. Causing 
a full job restart takes while to recover. 

With local state recovery (potentially targeted for 1.5) and running the job 
with 1 or 2 extra TaskManagers might mitigate this to an extent. However, does 
not solve the problem. 

It would be good to address this sooner than later. Do we have a design doc 
that outlines the interface changes, if not can we please start one.

 

> Retry individual failover-strategy for some time first before reverting to 
> full job restart
> ---
>
> Key: FLINK-8042
> URL: https://issues.apache.org/jira/browse/FLINK-8042
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, State Backends, Checkpointing
>Affects Versions: 1.3.2
>Reporter: Steven Zhen Wu
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Let's we will a taskmanager node. When Flink tries to attempt fine grained 
> recovery and fails replacement taskmanager node didn't come back in time, it 
> reverts to full job restart. 
> Stephan and Till was suggesting that Flink can/should retry fine grained 
> recovery for some time before giving up and reverting full job restart



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4596) RESTART_STRATEGY is not really pluggable

2016-09-22 Thread Monal Daxini (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513734#comment-15513734
 ] 

Monal Daxini commented on FLINK-4596:
-

Hi [~StephanEwen] what are your thoughts on this?

> RESTART_STRATEGY is not really pluggable
> 
>
> Key: FLINK-4596
> URL: https://issues.apache.org/jira/browse/FLINK-4596
> Project: Flink
>  Issue Type: Bug
>Reporter: Nagarjun Guraja
>
> Standalone cluster config accepts an implementation(class) as part of the 
> yaml config file but that does not work either as cluster level restart 
> strategy or streaming job level restart strategy
> CLUSTER LEVEL CAUSE: createRestartStrategyFactory converts configured value 
> of strategyname to lowercase and searches for class name using lowercased 
> string.
> JOB LEVEL CAUSE: Checkpointed streams have specific code to add 
> fixeddelayrestartconfiguration if no RestartConfiguration is specified in  
> the job env. Also, jobs cannot provide their own custom restart strategy 
> implementation and are constrained to pick up one of the three restart 
> strategies provided by flink. 
> FIX: Do not lower case the strategy config value, support a new 
> restartconfiguration to fallback to cluster level restart strategy and 
> support jobs to provide custom implementation of the strategy class itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)