[ https://issues.apache.org/jira/browse/BEAM-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kenneth Knowles reassigned BEAM-3484: ------------------------------------- Assignee: Chamikara Jayalath (was: Kenneth Knowles) > HadoopInputFormatIO reads big datasets invalid > ---------------------------------------------- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions > Reporter: Łukasz Gajowy > Assignee: Chamikara Jayalath > Priority: Major > Attachments: result_sorted1000000, result_sorted600000 > > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)