[ https://issues.apache.org/jira/browse/BEAM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Groh reassigned BEAM-3622: --------------------------------- Assignee: Charles Chen (was: Thomas Groh) > DirectRunner memory issue with Python SDK > ----------------------------------------- > > Key: BEAM-3622 > URL: https://issues.apache.org/jira/browse/BEAM-3622 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Reporter: yuri krnr > Assignee: Charles Chen > Priority: Major > > After running pipeline for a while in a streaming mode (reading from Pub/Sub > and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic > memory usage of a process. Using guppy as a profiler I got the following > results: > start > {noformat} > INFO *** MemoryReport Heap: > Partition of a set of 240208 objects. Total size = 34988840 bytes. > Index Count % Size % Cumulative % Kind (class / dict of class) > 0 88289 37 8696984 25 8696984 25 str > 1 53333 22 4897352 14 13594336 39 tuple > 2 5083 2 2790664 8 16385000 47 dict (no owner) > 3 1939 1 1749656 5 18134656 52 type > 4 699 0 1723272 5 19857928 57 dict of module > 5 12337 5 1579136 5 21437064 61 types.CodeType > 6 12403 5 1488360 4 22925424 66 function > 7 1939 1 1452616 4 24378040 70 dict of type > 8 677 0 709496 2 25087536 72 dict of 0x1e4d880 > 9 25603 11 614472 2 25702008 73 int > <1103 more rows. Type e.g. '_.more' to view.> > {noformat} > after several hours of running > {noformat} > INFO *** MemoryReport Heap: > Partition of a set of 1255662 objects. Total size = 315029632 bytes. > Index Count % Size % Cumulative % Kind (class / dict of class) > 0 95554 8 99755056 32 99755056 32 dict of > > apache_beam.runners.direct.bundle_factory._Bundle > 1 117943 9 54193192 17 153948248 49 dict (no owner) > 2 161068 13 27169296 9 181117544 57 unicode > 3 94571 8 26479880 8 207597424 66 dict of apache_beam.pvalue.PBegin > 4 126461 10 12715336 4 220312760 70 str > 5 44374 4 12424720 4 232737480 74 dict of > apitools.base.protorpclite.messages.FieldList > 6 44374 4 6348624 2 239086104 76 > apitools.base.protorpclite.messages.FieldList > 7 95556 8 6115584 2 245201688 78 > apache_beam.runners.direct.bundle_factory._Bundle > 8 94571 8 6052544 2 251254232 80 apache_beam.pvalue.PBegin > 9 57371 5 5218424 2 256472656 81 tuple > <1187 more rows. Type e.g. '_.more' to view.> > {noformat} > > I see that every bundle still sits in memory and all its data too. why aren't > the gc-ed? > What is the policy for gc for the dataflow processes? -- This message was sent by Atlassian JIRA (v7.6.3#76005)