[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues

Anonymous (Jira) Thu, 13 Apr 2023 03:58:07 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anonymous updated BEAM-7866:
----------------------------
    Status: Triage Needed  (was: Resolved)

> Python MongoDB IO performance and correctness issues
> ----------------------------------------------------
>
>                 Key: BEAM-7866
>                 URL: https://issues.apache.org/jira/browse/BEAM-7866
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Eugene Kirpichov
>            Assignee: Yichi Zhang
>            Priority: P2
>             Fix For: 2.15.0
>
>          Time Spent: 12h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues

Reply via email to