[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-20 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=93122&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-93122
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 20/Apr/18 08:10
Start Date: 20/Apr/18 08:10
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-383019036
 
 
   @lgajowy Thank you for testing this! 
   @iemejia  Thank you review and merging!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 93122)
Time Spent: 2h 20m  (was: 2h 10m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92820&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92820
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 20:05
Start Date: 19/Apr/18 20:05
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5166: [BEAM-3484] Fix split 
issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382863883
 
 
   Thanks @aromanenko-dev for finding the root issue + fixing it. Also thanks 
@lgajowy for reporting it and for con verifying that it works.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92820)
Time Spent: 2h 10m  (was: 2h)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92819&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92819
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 20:04
Start Date: 19/Apr/18 20:04
Worklog Time Spent: 10m 
  Work Description: iemejia closed pull request #5166: [BEAM-3484] Fix 
split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java
 
b/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java
index b22d57caa67..0ffd402320d 100644
--- 
a/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java
+++ 
b/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java
@@ -62,6 +62,7 @@
 import org.apache.hadoop.mapreduce.RecordReader;
 import org.apache.hadoop.mapreduce.TaskAttemptContext;
 import org.apache.hadoop.mapreduce.TaskAttemptID;
+import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
 import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -163,6 +164,21 @@
  *  .withValueTranslation(myOutputValueType);
  * }
  * 
+ *
+ * IMPORTANT! In case of using {@code DBInputFormat} to read data from 
RDBMS, Beam parallelizes
+ * the process by using LIMIT and OFFSET clauses of SQL query to fetch 
different ranges of records
+ * (as a split) by different workers. To guarantee the same order and proper 
split of results you
+ * need to order them by one or more keys (either PRIMARY or UNIQUE). It can 
be done during
+ * configuration step, for example:
+ *
+ * 
+ * {@code
+ * Configuration conf = new Configuration();
+ * conf.set(DBConfiguration.INPUT_TABLE_NAME_PROPERTY, tableName);
+ * conf.setStrings(DBConfiguration.INPUT_FIELD_NAMES_PROPERTY, "id", "name");
+ * conf.set(DBConfiguration.INPUT_ORDER_BY_PROPERTY, "id ASC");
+ * }
+ * 
  */
 @Experimental(Experimental.Kind.SOURCE_SINK)
 public class HadoopInputFormatIO {
@@ -283,7 +299,9 @@
 
 /**
  * Validates that the mandatory configuration properties such as 
InputFormat class, InputFormat
- * key and value classes are provided in the Hadoop configuration.
+ * key and value classes are provided in the Hadoop configuration. In case 
of using {@code
+ * DBInputFormat} you need to order results by one or more keys. It can be 
done by setting
+ * configuration option "mapreduce.jdbc.input.orderby".
  */
 private void validateConfiguration(Configuration configuration) {
   checkArgument(configuration != null, "configuration can not be null");
@@ -294,6 +312,13 @@ private void validateConfiguration(Configuration 
configuration) {
   configuration.get("key.class") != null, "configuration must contain 
\"key.class\"");
   checkArgument(
   configuration.get("value.class") != null, "configuration must 
contain \"value.class\"");
+  if 
(configuration.get("mapreduce.job.inputformat.class").endsWith("DBInputFormat"))
 {
+checkArgument(
+configuration.get(DBConfiguration.INPUT_ORDER_BY_PROPERTY) != null,
+"Configuration must contain \""
++ DBConfiguration.INPUT_ORDER_BY_PROPERTY
++ "\" when using DBInputFormat");
+  }
 }
 
 /**
diff --git 
a/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java
 
b/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java
index 58f3b0dafa0..e24dd68dd2c 100644
--- 
a/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java
+++ 
b/sdks/java/io/hadoop-input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java
@@ -110,6 +110,7 @@ private static void 
setupHadoopConfiguration(IOTestPipelineOptions options) {
 );
 conf.set(DBConfiguration.INPUT_TABLE_NAME_PROPERTY, tableName);
 conf.setStrings(DBConfiguration.INPUT_FIELD_NAMES_PROPERTY, "id", "name");
+conf.set(DBConfiguration.INPUT_ORDER_BY_PROPERTY, "id ASC");
 conf.setClass(DBConfiguration.INPUT_CLASS_PROPERTY, 
TestRowDBWritable.class, DBWritable.class);
 
 conf.setClass("key.class", LongWritable.class, Object.class);
diff --git 
a/sdks/java/io/hadoop-input-for

[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92786&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92786
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 18:46
Start Date: 19/Apr/18 18:46
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #5166: [BEAM-3484] Fix split 
issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382841237
 
 
   Run the IT on this branch for 5_000_000 records, twice. Both tests succeeded 
and I no longer see the issue. Thanks!
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92786)
Time Spent: 1h 50m  (was: 1h 40m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92659&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92659
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 16:18
Start Date: 19/Apr/18 16:18
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382795598
 
 
   Run Java PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92659)
Time Spent: 1h 40m  (was: 1.5h)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92640&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92640
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 15:52
Start Date: 19/Apr/18 15:52
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #5166: [BEAM-3484] Fix split 
issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382787122
 
 
   This looks cool. :)
   
   @iemejia I'll verify this as you asked. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92640)
Time Spent: 1h 20m  (was: 1h 10m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92641&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92641
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 15:52
Start Date: 19/Apr/18 15:52
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #5166: [BEAM-3484] Fix split 
issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382787122
 
 
   @aromanenko-dev This looks cool. :)
   
   @iemejia I'll verify this as you asked. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92641)
Time Spent: 1.5h  (was: 1h 20m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92636&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92636
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 15:37
Start Date: 19/Apr/18 15:37
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382780585
 
 
   Run Java Gradle PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92636)
Time Spent: 1h 10m  (was: 1h)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92632
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 15:36
Start Date: 19/Apr/18 15:36
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382781601
 
 
   Run Java PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92632)
Time Spent: 1h  (was: 50m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92627
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 15:33
Start Date: 19/Apr/18 15:33
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382780585
 
 
   Run Java Gradle PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92627)
Time Spent: 50m  (was: 40m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92562&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92562
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 19/Apr/18 13:44
Start Date: 19/Apr/18 13:44
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382741185
 
 
   @iemejia Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92562)
Time Spent: 40m  (was: 0.5h)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-18 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92146&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92146
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 18/Apr/18 15:45
Start Date: 18/Apr/18 15:45
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #5166: [BEAM-3484] Fix split 
issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382433558
 
 
   @lgajowy can you please help us validate if the proposed fix by 
@aromanenko-dev fix the issue. In our local tests it does but better to double 
check with you since you reported it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92146)
Time Spent: 0.5h  (was: 20m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-18 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92143&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92143
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 18/Apr/18 15:33
Start Date: 18/Apr/18 15:33
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev commented on issue #5166: [BEAM-3484] 
Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166#issuecomment-382429406
 
 
   R: @iemejia @chamikaramj 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92143)
Time Spent: 20m  (was: 10m)

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-3484) HadoopInputFormatIO reads big datasets invalid

2018-04-18 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3484?focusedWorklogId=92133&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-92133
 ]

ASF GitHub Bot logged work on BEAM-3484:


Author: ASF GitHub Bot
Created on: 18/Apr/18 15:21
Start Date: 18/Apr/18 15:21
Worklog Time Spent: 10m 
  Work Description: aromanenko-dev opened a new pull request #5166: 
[BEAM-3484] Fix split issue in HadoopInputFormatIOIT
URL: https://github.com/apache/beam/pull/5166
 
 
   When using DBInputFormat to fetch data from RDBMS, Beam parallelises the 
process by using LIMIT and OFFSET clauses of SQL query to fetch different 
ranges of records (as a split) by different workers. By default, RDBMS doesn't 
guarantee predicted order of results and for the same query it can be different 
every time. So, it can cause duplicates or missing of some rows in final result.
   To guarantee the same order and proper split of results the client must 
order them by one or more keys (either PRIMARY or UNIQUE). It can be done by 
setting configuration option in Hadoop configuration.
   
   
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand:
  - [x] What the pull request does
  - [x] Why it does it
  - [x] How it does it
  - [x] Why this approach
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 92133)
Time Spent: 10m
Remaining Estimate: 0h

> HadoopInputFormatIO reads big datasets invalid
> --
>
> Key: BEAM-3484
> URL: https://issues.apache.org/jira/browse/BEAM-3484
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-hadoop
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Łukasz Gajowy
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: result_sorted100, result_sorted60
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For big datasets HadoopInputFormat sometimes skips/duplicates elements from 
> database in resulting PCollection. This gives incorrect read result.
> Occurred to me while developing HadoopInputFormatIOIT and running it on 
> dataflow. For datasets smaller or equal to 600 000 database rows I wasn't 
> able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 
> 1 000 000. 
> Attachments:
>   - text file with sorted HadoopInputFormat.read() result saved using 
> TextIO.write().to().withoutSharding(). If you look carefully you'll notice 
> duplicates or missing values that should not happen
>  - same text file for 600 000 records not having any duplicates and missing 
> elements
>  - link to a PR with HadoopInputFormatIO integration test that allows to 
> reproduce this issue. At the moment of writing, this code is not merged yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)