[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=93122&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-93122 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 20/Apr/18 08:10 Start Date: 20/Apr/18 08:10 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-383019036 @lgajowy Thank you for testing this! @iemejia Thank you review and merging! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 93122) Time Spent: 2h 20m (was: 2h 10m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92820&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92820 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 20:05 Start Date: 19/Apr/18 20:05 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382863883 Thanks @aromanenko-dev for finding the root issue + fixing it. Also thanks @lgajowy for reporting it and for con verifying that it works. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92820) Time Spent: 2h 10m (was: 2h) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92819&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92819 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 20:04 Start Date: 19/Apr/18 20:04 Worklog Time Spent: 10m Work Description: iemejia closed pull request #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java b/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java index b22d57caa67..0ffd402320d 100644 --- a/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java +++ b/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java @@ -62,6 +62,7 @@ import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.TaskAttemptID; +import org.apache.hadoop.mapreduce.lib.db.DBConfiguration; import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -163,6 +164,21 @@ * .withValueTranslation(myOutputValueType); * } * + * + * IMPORTANT! In case of using {@code DBInputFormat} to read data from RDBMS, Beam parallelizes + * the process by using LIMIT and OFFSET clauses of SQL query to fetch different ranges of records + * (as a split) by different workers. To guarantee the same order and proper split of results you + * need to order them by one or more keys (either PRIMARY or UNIQUE). It can be done during + * configuration step, for example: + * + * + * {@code + * Configuration conf = new Configuration(); + * conf.set(DBConfiguration.INPUT_TABLE_NAME_PROPERTY, tableName); + * conf.setStrings(DBConfiguration.INPUT_FIELD_NAMES_PROPERTY, "id", "name"); + * conf.set(DBConfiguration.INPUT_ORDER_BY_PROPERTY, "id ASC"); + * } + * */ @Experimental(Experimental.Kind.SOURCE_SINK) public class HadoopInputFormatIO { @@ -283,7 +299,9 @@ /** * Validates that the mandatory configuration properties such as InputFormat class, InputFormat - * key and value classes are provided in the Hadoop configuration. + * key and value classes are provided in the Hadoop configuration. In case of using {@code + * DBInputFormat} you need to order results by one or more keys. It can be done by setting + * configuration option "mapreduce.jdbc.input.orderby". */ private void validateConfiguration(Configuration configuration) { checkArgument(configuration != null, "configuration can not be null"); @@ -294,6 +312,13 @@ private void validateConfiguration(Configuration configuration) { configuration.get("key.class") != null, "configuration must contain \"key.class\""); checkArgument( configuration.get("value.class") != null, "configuration must contain \"value.class\""); + if (configuration.get("mapreduce.job.inputformat.class").endsWith("DBInputFormat")) { +checkArgument( +configuration.get(DBConfiguration.INPUT_ORDER_BY_PROPERTY) != null, +"Configuration must contain \"" ++ DBConfiguration.INPUT_ORDER_BY_PROPERTY ++ "\" when using DBInputFormat"); + } } /** diff --git a/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java b/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java index 58f3b0dafa0..e24dd68dd2c 100644 --- a/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java +++ b/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java @@ -110,6 +110,7 @@ private static void setupHadoopConfiguration(IOTestPipelineOptions options) { ); conf.set(DBConfiguration.INPUT_TABLE_NAME_PROPERTY, tableName); conf.setStrings(DBConfiguration.INPUT_FIELD_NAMES_PROPERTY, "id", "name"); +conf.set(DBConfiguration.INPUT_ORDER_BY_PROPERTY, "id ASC"); conf.setClass(DBConfiguration.INPUT_CLASS_PROPERTY, TestRowDBWritable.class, DBWritable.class); conf.setClass("key.class", LongWritable.class, Object.class); diff --git a/sdks/java/io/hadoop-input-for
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92786&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92786 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 18:46 Start Date: 19/Apr/18 18:46 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382841237 Run the IT on this branch for 5_000_000 records, twice. Both tests succeeded and I no longer see the issue. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92786) Time Spent: 1h 50m (was: 1h 40m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92659&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92659 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 16:18 Start Date: 19/Apr/18 16:18 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382795598 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92659) Time Spent: 1h 40m (was: 1.5h) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92640&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92640 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 15:52 Start Date: 19/Apr/18 15:52 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382787122 This looks cool. :) @iemejia I'll verify this as you asked. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92640) Time Spent: 1h 20m (was: 1h 10m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92641&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92641 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 15:52 Start Date: 19/Apr/18 15:52 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382787122 @aromanenko-dev This looks cool. :) @iemejia I'll verify this as you asked. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92641) Time Spent: 1.5h (was: 1h 20m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 1.5h > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92636&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92636 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 15:37 Start Date: 19/Apr/18 15:37 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382780585 Run Java Gradle PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92636) Time Spent: 1h 10m (was: 1h) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92632 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 15:36 Start Date: 19/Apr/18 15:36 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382781601 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92632) Time Spent: 1h (was: 50m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 1h > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92627 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 15:33 Start Date: 19/Apr/18 15:33 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382780585 Run Java Gradle PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92627) Time Spent: 50m (was: 40m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 50m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92562&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92562 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 19/Apr/18 13:44 Start Date: 19/Apr/18 13:44 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382741185 @iemejia Done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92562) Time Spent: 40m (was: 0.5h) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 40m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92146&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92146 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 18/Apr/18 15:45 Start Date: 18/Apr/18 15:45 Worklog Time Spent: 10m Work Description: iemejia commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382433558 @lgajowy can you please help us validate if the proposed fix by @aromanenko-dev fix the issue. In our local tests it does but better to double check with you since you reported it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92146) Time Spent: 0.5h (was: 20m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 0.5h > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92143&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92143 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 18/Apr/18 15:33 Start Date: 18/Apr/18 15:33 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166#issuecomment-382429406 R: @iemejia @chamikaramj This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92143) Time Spent: 20m (was: 10m) > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 20m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid
[ https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92133&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92133 ] ASF GitHub Bot logged work on BEAM-3484: Author: ASF GitHub Bot Created on: 18/Apr/18 15:21 Start Date: 18/Apr/18 15:21 Worklog Time Spent: 10m Work Description: aromanenko-dev opened a new pull request #5166: [BEAM-3484] Fix split issue in HadoopInputFormatIOIT URL: https://github.com/apache/beam/pull/5166 When using DBInputFormat to fetch data from RDBMS, Beam parallelises the process by using LIMIT and OFFSET clauses of SQL query to fetch different ranges of records (as a split) by different workers. By default, RDBMS doesn't guarantee predicted order of results and for the same query it can be different every time. So, it can cause duplicates or missing of some rows in final result. To guarantee the same order and proper split of results the client must order them by one or more keys (either PRIMARY or UNIQUE). It can be done by setting configuration option in Hadoop configuration. Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand: - [x] What the pull request does - [x] Why it does it - [x] How it does it - [x] Why this approach - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 92133) Time Spent: 10m Remaining Estimate: 0h > HadoopInputFormatIO reads big datasets invalid > -- > > Key: BEAM-3484 > URL: https://issues.apache.org/jira/browse/BEAM-3484 > Project: Beam > Issue Type: Bug > Components: io-java-hadoop >Affects Versions: 2.3.0, 2.4.0 >Reporter: Łukasz Gajowy >Assignee: Alexey Romanenko >Priority: Minor > Fix For: 2.5.0 > > Attachments: result_sorted100, result_sorted60 > > Time Spent: 10m > Remaining Estimate: 0h > > For big datasets HadoopInputFormat sometimes skips/duplicates elements from > database in resulting PCollection. This gives incorrect read result. > Occurred to me while developing HadoopInputFormatIOIT and running it on > dataflow. For datasets smaller or equal to 600 000 database rows I wasn't > able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, > 1 000 000. > Attachments: > - text file with sorted HadoopInputFormat.read() result saved using > TextIO.write().to().withoutSharding(). If you look carefully you'll notice > duplicates or missing values that should not happen > - same text file for 600 000 records not having any duplicates and missing > elements > - link to a PR with HadoopInputFormatIO integration test that allows to > reproduce this issue. At the moment of writing, this code is not merged yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)