[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92133&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92133 ]
ASF GitHub Bot logged work on BEAM-3484: ---------------------------------------- Author: ASF GitHub Bot Created on: 18/Apr/18 15:21 Start Date: 18/Apr/18 15:21 Worklog Time Spent: 10m Work Description: aromanenko-dev opened a new pull request #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166 When using DBInputFormat to fetch data from RDBMS, Beam parallelises the process by using LIMIT and OFFSET clauses of SQL query to fetch different ranges of records (as a split) by different workers. By default, RDBMS doesn't guarantee predicted order of results and for the same query it can be different every time. So, it can cause duplicates or missing of some rows in final result. To guarantee the same order and proper split of results the client must order them by one or more keys (either PRIMARY or UNIQUE). It can be done by setting configuration option in Hadoop configuration. ------------------------ Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand: - [x] What the pull request does - [x] Why it does it - [x] How it does it - [x] Why this approach - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 92133) Time Spent: 10m Remaining Estimate: 0h > HadoopInputFormatIO reads big datasets invalid > ---------------------------------------------- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop > Affects Versions: 2.3.0, 2.4.0 > Reporter: Łukasz Gajowy > Assignee: Alexey Romanenko > Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted1000000, result_sorted600000 > > Time Spent: 10m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)