[ 
https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-5500:
----------------------------------
    Labels: portability-flink  (was: )

> Portable python sdk worker leaks memory in streaming mode
> ---------------------------------------------------------
>
>                 Key: BEAM-5500
>                 URL: https://issues.apache.org/jira/browse/BEAM-5500
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-harness
>            Reporter: Micah Wylde
>            Assignee: Robert Bradshaw
>            Priority: Major
>              Labels: portability-flink
>         Attachments: chart.png
>
>
> When using the portable python sdk with flink in streaming mode, we see that 
> the python worker processes steadily increase memory usage until they are OOM 
> killed. This behavior is consistent across various kinds of streaming 
> pipelines, including those with fixed windows and global windows.
> A simple wordcount-like pipeline demonstrates the issue for us (note this is 
> run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides 
> access to kinesis as a portable streaming source):
> {code:java}
> counts = (p
> | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream')
> | 'decode' >> beam.FlatMap(decode) # parses from json into python objs
> | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1))
> | 'window' >> beam.WindowInto(window.GlobalWindows(),
>                               trigger=AfterProcessingTime(15 * 1000),
>                               accumulation_mode=AccumulationMode.DISCARDING)
> | 'group' >> beam.GroupByKey()
> | 'count' >> beam.Map(count_ones)
> | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x))
> {code}
> When run, we see a steady increase in memory usage in the sdk_worker process. 
> Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory 
> usage over time and found that it's largely dicts and strings (see attached 
> chart).
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to