[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

ASF GitHub Bot (JIRA) Fri, 20 Apr 2018 01:12:11 -0700

     [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=93122&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-93122
 ]


ASF GitHub Bot logged work on BEAM-3484:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Apr/18 08:10
            Start Date: 20/Apr/18 08:10
    Worklog Time Spent: 10m 
      Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-383019036
 
 
   @lgajowy Thank you for testing this! 
   @iemejia  Thank you review and merging!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 93122)
    Time Spent: 2h 20m  (was: 2h 10m)

> HadoopInputFormatIO reads big datasets invalid
> ----------------------------------------------
>
>                 Key: BEAM-3484
>                 URL: https://issues.apache.org/jira/browse/BEAM-3484
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-hadoop
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Łukasz Gajowy
>            Assignee: Alexey Romanenko
>            Priority: Minor
>             Fix For: 2.5.0
>
>         Attachments: result_sorted1000000, result_sorted600000
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

Reply via email to