[ https://issues.apache.org/jira/browse/BEAM-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía updated BEAM-3484: ------------------------------- Fix Version/s: 2.5.0 > HadoopInputFormatIO reads big datasets invalid > ---------------------------------------------- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop > Affects Versions: 2.3.0, 2.4.0 > Reporter: Łukasz Gajowy > Assignee: Alexey Romanenko > Priority: Blocker > Fix For: 2.5.0 > > Attachments: result_sorted1000000, result_sorted600000 > > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)