Comment inline. On Thu, Feb 11, 2016 at 12:21 PM, Timothy Farkas <[email protected]> wrote:
> +1 for the idea. > > Gaurav, this could be done idempotently in the same way that dynamic > repartitioning is done idempotently. All the partitions are rolled back to > a common checkpoint and the new StreamCodec is applied starting then. The > statistics that the Stream Codec are given are the statistics for the > windows computed before the common checkpoint that the partitions are > rolled back to. > > In fact I think this feature could be added easily by avoiding buffer > server entirely and by allowing the Partitioner to redefine the StreamCodec > for the operator when define partitions is called. > Are you saying this in context of recovery or in general? > > Thanks, > Tim > > On Thu, Feb 11, 2016 at 12:07 PM, Amol Kekre <[email protected]> wrote: > > > Gaurav, > > It would not be idempotent per partition, but will be across all > partitions > > combined. In this case the user would have explicitly asked for such a > > pattern. > > > > Thks, > > Amol > > > > > > On Thu, Feb 11, 2016 at 12:04 PM, Gaurav Gupta <[email protected] > > > > wrote: > > > > > Pramod, > > > > > > How would it work with recovery? There could be cases where a tuple > went > > to > > > P1 and post recovery it can go to P2 > > > > > > Gaurav > > > > > > On Thu, Feb 11, 2016 at 11:56 AM, Pramod Immaneni < > > [email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > There are scenarios where the downstream partitions of an upstream > > > operator > > > > are generally not performing uniformly resulting in an overall > > > sub-optimal > > > > performance dictated by the slowest partitions. The reasons could be > > data > > > > related such as some partitions are receiving more data to process > than > > > the > > > > others or could be environment related such as some partitions are > > > running > > > > slower than others because they are on heavily loaded nodes. > > > > > > > > A solution based on currently available functionality in the engine > > would > > > > be to write a StreamCodec implementation to distribute data among the > > > > partitions such that each partition is receiving similar amount of > data > > > to > > > > process. We should consider adding StreamCodecs like these to the > > library > > > > but these however do not solve the problem when it is environment > > > related. > > > > > > > > For that a better and more comprehensive approach would be look at > how > > > data > > > > is being consumed by the downstream partitions from the BufferServer > > and > > > > use that information to make decisions on how to send future data. If > > > some > > > > partitions are behind others in consuming data then data can be > > directed > > > to > > > > the other partitions. One way to do this would be to relay this type > of > > > > statistical and positional information from BufferServer to the > > upstream > > > > publishers. The publishers can use this information in ways such as > > > making > > > > it available to StreamCodecs to affect destination of future data. > > > > > > > > What do you think. > > > > > > > > Thanks > > > > > > > > > >
