[
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anonymous updated BEAM-7866:
----------------------------
Status: Triage Needed (was: Resolved)
> Python MongoDB IO performance and correctness issues
> ----------------------------------------------------
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Eugene Kirpichov
> Assignee: Yichi Zhang
> Priority: P2
> Fix For: 2.15.0
>
> Time Spent: 12h
> Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
> splits the query result by computing number of results in constructor, and
> then in each reader re-executing the whole query and getting an index
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily
> deterministic, so the idea of index ranges on it is meaningless: each shard
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently
> to reading and splitting. E.g. if the database contained documents with ids
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the
> assumption that these shards would contain respectively 10 20 30, and 40 50),
> and then suppose shard 10 20 30 is read and then document 25 is inserted -
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items,
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results,
> which 1) means the constructor can be super slow and 2) it won't work at all
> if the database is unavailable at the time the pipeline is constructed (e.g.
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class
> has extensive coverage with it, and the tests pass. This is because the tests
> return the same results in the same order. I don't know how to catch this
> automatically, and I don't know how to catch the performance issue
> automatically, but these would all be important follow-up items after the
> actual fix.
> CC: [~chamikara] as reviewer.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)