[ https://issues.apache.org/jira/browse/BEAM-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339797#comment-16339797 ]
Eugene Kirpichov commented on BEAM-2680: ---------------------------------------- Note: as a workaround, normally a user should be able to "shard" the input of Watch (e.g. a filepattern) so that each individual poll result is smaller. > Improve scalability of the Watch transform > ------------------------------------------ > > Key: BEAM-2680 > URL: https://issues.apache.org/jira/browse/BEAM-2680 > Project: Beam > Issue Type: Bug > Components: sdk-java-core > Reporter: Eugene Kirpichov > Assignee: Eugene Kirpichov > Priority: Major > > [https://github.com/apache/beam/pull/3565] introduces the Watch transform > [http://s.apache.org/beam-watch-transform]. > The implementation leaves several scalability-related TODOs: > 1) The state stores hashes and timestamps of outputs that have already been > output and should be omitted from future polls. We could garbage-collect this > state, e.g. dropping elements from "completed" and from addNewAsPending() if > their timestamp is more than X behind the watermark. > 2) When a poll returns a huge number of elements, we don't necessarily have > to add all of them into state.pending - instead we could add only N oldest > elements and ignore others, relying on future poll rounds to provide them, in > order to avoid blowing up the state. Combined with garbage collection of > GrowthState.completed, this would make the transform scalable to very large > poll results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)