[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

ASF GitHub Bot (JIRA) Wed, 18 Apr 2018 08:22:17 -0700

     [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92133&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92133
 ]


ASF GitHub Bot logged work on BEAM-3484:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Apr/18 15:21
            Start Date: 18/Apr/18 15:21
    Worklog Time Spent: 10m 
      Work Description: aromanenko-dev opened a new pull request #5166: 
[BEAM-3484] Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166
 
 
   When using DBInputFormat to fetch data from RDBMS, Beam parallelises the 
process by using LIMIT and OFFSET clauses of SQL query to fetch different 
ranges of records (as a split) by different workers. By default, RDBMS doesn't 
guarantee predicted order of results and for the same query it can be different 
every time. So, it can cause duplicates or missing of some rows in final result.
   To guarantee the same order and proper split of results the client must 
order them by one or more keys (either PRIMARY or UNIQUE). It can be done by 
setting configuration option in Hadoop configuration.
   
   ------------------------
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
    - [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
    - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
    - [x] Write a pull request description that is detailed enough to 
understand:
      - [x] What the pull request does
      - [x] Why it does it
      - [x] How it does it
      - [x] Why this approach
    - [x] Each commit in the pull request should have a meaningful subject line 
and body.
    - [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
    - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 92133)
            Time Spent: 10m
    Remaining Estimate: 0h

> HadoopInputFormatIO reads big datasets invalid
> ----------------------------------------------
>
>                 Key: BEAM-3484
>                 URL: https://issues.apache.org/jira/browse/BEAM-3484
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-hadoop
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Łukasz Gajowy
>            Assignee: Alexey Romanenko
>            Priority: Minor
>             Fix For: 2.5.0
>
>         Attachments: result_sorted1000000, result_sorted600000
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

Reply via email to