[ https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907646#comment-16907646 ]
yifan zou commented on BEAM-7866: --------------------------------- The PR got merged. I mark this ticket as resolved. > Python MongoDB IO performance and correctness issues > ---------------------------------------------------- > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Reporter: Eugene Kirpichov > Assignee: Yichi Zhang > Priority: Major > Fix For: 2.15.0 > > Time Spent: 12h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)